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BACKGROUND 

The described technology relates generally to accessing data and 

10 particularly to accessing data from data sources with diverse formats. 

Large organizations may have their digital data stored in various data 
stores, such as databases and file systems, in diverse and incompatible formats. 
Different groups within the large organizations may have created their own data 
stores to meet the needs of the group. Each group would typically select its own 

15 type of data storage system and format to meet its particular needs. Traditionally, 
these data stores were created independently of any other data stores within the 
organization. As a result, the various data stores of an organization often 
contained duplicate and inconsistent data. 

Recently, these large organizations have adopted standards such as the 

20 extensible markup language ("XML") for representing data in a uniform format. 
The use of XML by each group within an organization increases the compatibility 
of the data stores. It is, however, difficult for organizations to provide an XML 
interface to each of its existing data stores. The organizations would need to 
expend considerable resources to provide a mapping between their existing data 

25 stores or other sources of data and the XML formats. 
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It would be desirable to have a system that would facilitate the integrating 
of data stores with incompatible formats. 

BRIEF DESCRIPTION OF THE DRAWINGS 

Figure 1 illustrates the schema of this XML document. 

Figure 2 represents a Joinin graph (JIG) for the match expression of Table 

8. 

Figure 3 is a block diagram illustrating the overall organization of an 
execution program generated by the data integration engine. 

Figure 4 is a block diagram illustrating the function to generate an 
execution program. 

Figure 5 is a flow diagram illustrating processing of the generate extract 
program function in one embodiment. 

Figure 6 is a flow diagram illustrating the processing of the generate extract 
plan function in one embodiment. 

Figure 7 is a flow diagram illustrating processing of the match expression 
function in one embodiment. 

Figure 8 is a flow diagram illustrating the processing of the create Joinin 
graph function in one embodiment. 

Figure 9 is a flow diagram illustrating processing of the generate Joinin 
graph into one embodiment. 

Figure 10 illustrates the tables of the data store. 

Figure 1 1 illustrates the results of the sorted outer union for the tables of 
Figure 10. 

Figure 12 illustrates the SQL query for each of the tables of Figure 10 paid 
to generate the sorted outer union. 

Figure 13 is a flow diagram illustrating the processing of generating a 
sorted outer union. 
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Figure 14 is a flow diagram illustrating processing of a generate SQL query 
function in one embodiment. 

Figure 15 is a block diagram illustrating an extract program. 

Figure 16 is a flow diagram that illustrates code of a join node of an extract 
program in one embodiment. 

Figure 17 illustrates the output of the nodes of the extraction plan to Figure 

15. 

Figure 18 illustrates a final NCR structure. 
Figure 19 illustrates the Correspondence Tree. 

DETAILED DESCRIPTION 

A method and system for providing data integration of multiple data stores 
with diverse formats is provided. In one embodiment, the data integration engine 
accepts queries using a standard query language such as XML-QL, executes those 
queries against the multiple data stores, and returns the results. The data stores 
may include relational databases, hierarchical databases, file systems, application 
data available via APIs, and so on. A query may reference data that resides in 
different data stores. The data integration engine allows operations such as joins 
across multiple data stores. In one embodiment, the data integration engine uses 
XML as the data model in which the data from the various data stores is 
represented. The data integration engine processes a query by parsing the query 
into an intemal representation, compiling and optimizing the intemal 
representation into a physical execution representation, and then executing the 
execution representation. By providing a uniform and data model, the data 
integration engine allows access to data stores in diverse formats. 

In one embodiment, the data integration engine executes a query on a data 
store by fu-st providing a mapping of the data store format into an XML format. 
The query for the data store is based on XML format. The data integration engine 
upon receiving a query, generates a native query for the data store from the 
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received query using the provided mapping. The data integration engine then 
executes the native query to generate data in a native format needed to generate 
the results of the received query. The data integration engine then converts the 
data in the native format into data in a format referred to as nested conditional 
relations ("NCR"). The data integration engine then applies various operators 
(e.g., joins and unions) to the data in NCR format to generate the query results in 
an NCR format. The data integration engine then converts the results in the NCR 
format into an XML format. In this way, the integration engine can provide access 
to various data sources in different formats. 

A nested conditional relation is a table in which each row may have a 
different schema and each column is either a primitive type or a nested NCR. The 
schema of each row in an NCR is indicated by a tag, which can be considered to 
be the zero column of the row. For example, certain rows of the table may 
represent employees of a company and have columns named "first name," "last 
name," "phone number," and so on. Other rows in the table may represent 
departments within the company and have columns named "department name," 
"department head," and so on. The tag for a row indicates whether the row is an 
employee or a department row. A column for a certain type of row may itself 
contain a nested conditional relation. For example, an employee row may include 
a column named "skills" that contains a table with sub-rows containing 
information relating to computer skills and accounting skills of the employee. The 
table may itself be a nested conditional relation in that each sub-row may include a 
tag indicating whether the row represents a computer skill or an accounting skill. 
The nesting of nested conditional relations may occiu* to an arbitrary level. The 
NCR format is described below in detail. 

The following example illustrates a data store, a mapping for the data store, 
a query, an LMatch representation for the query, a Joinin graph for the query, and 
an SQL query used to retrieve the data from the data source. Tables 1-3 illustrate 
an example of data that is stored in a data store such as a relational database. The 
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relational database contains three tables: DEPARTMENTS table, EMPLOYEES 
table, and BUELDINGSDOCS table. 



TABLE 1 DEPARTMENTS 



Name 


Contact 


Finance 


E1247 


Engineering 


E3214 



TABLE 2 EMPLOYEES 





Fname 


Lname 


Dept 


Bldg 


Office 


Manager 


E0764 


Bobby 


Darrows 


Finance 


B 


102 


El 247 


E0334 


Alice 


LeGlass 


Finance 


B 


103 


El 247 


E1247 


David 


Winston 


Finance 


B 


110 


NULL 


E3214 


David 


McKinzie 


Engineering 


L 


NULL 


E1153 


E0868 


Misha 


Niev 


Engineering 


L 


15 


E1153 


E0012 


David 


Herford 


Engineering 


M 


332 


E1153 


E1153 


Charlotte 


Burton 


Engineering 


M 


330 


E0124 


E0124 


David 


Wong 


Engineering 


L 


12 


NULL 



TABLE 3 BUILDINGSDOCS 



Buildine 


Office 


Phone 


MaintContact 


B 


102 


xll02 


E0764 


B 


103 


xll03 


E0764 


B 


110 


xUlO 


E0764 


L 


lobby 


xOOOl 


E3214 


L 


12 


x0120 


E3214 


L 


15 


x0150 


E3214 


M 


330 


x233G 


E3214 


M 


332 


x2332 


E3214 



The DEPARTMENTS table contains one row for each department of an 
organization. As illustrated by Table 1, the organization has a finance and an 
engineering department. The DEPARTMENTS table contains two columns: 
name and contact. The name column contains the name of the department, and the 
contact column contains the employee identifier of the contact person for the 
department. For example, the first row of the table indicates that the department is 
"fmance" and that the contact employee is "E1247." The EMPLOYEES table 
contains a row for each employee in the organization. Each row includes seven 
columns: ID, Fname, Lname, Dept, Bldg, Office, and Manager. The ID column 
uniquely identifies the employee, the Fname column contains the first name of the 
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employee, the Lname column contains the last name of the employee, the Dept 
column identifies the employee's department, the Bldg column identifies the 
building in which the employee is located, the Office colimin identifies the 
employee's office within the building, and the Manager column identifies the 

5 employee's manager. The Dept column contains one of the values fi^om the Name 
column of the DEPARTMENTS table. The BUILDINGSDOCS table contains a 
row for each office within each building of the organization. The 
BUILDINGSDOCS table contains four columns: Building, Office, Phone, and 
MaintContact. The Building column identifies a building, the Office column 

10 identifies an office within the building, the Phone column contains the phone 
number associated with that office, and the MaintContact column identifies the 
employee who is the maintenance contact for the office. The combination of the 
Building and Office columns imiquely identifies each row. The Bldg and Office 
columns of the EMPLOYEES table identifies a row within the 

15 BUILDINGSDOCS table. 

Table 4 is an example of data stored as an XML document. 

TABLE 4 

«leptlist> 

<deptname="Finance"> 
<einployee> 

<naine><first>Bobby</first><last>Darrows</lastx/nanie> 

<office phone="xll02"/> 
</employee> 
<empioyee> 

<name><first>Alice</first><last>LeGlass</last></name> 
<office phone="xll037> 
</employee> 

</dept> 

<dept name="Engineering"> 
<employee> 

<name><&st>David</first><last>McKinzie</last></name> 
</employee> 
<employee> 

<name><first>Misha</first><Iast>Niev</last></name> 
<office phone="x0150"/> 
</employee> 

</dept> 
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</deptlist> 



The XML document includes the root element <deptlist> that has a name 
attribute and that contains a <dept> element corresponding to each department 
within an organization. Each <dept> element contains an <employee> element for 
5 each employee within the department. Each <employee> element contains a 
<name> element and optionally an <office> element. The <name> element 
includes a <first> element and <last> element. The <ofFice> element includes a 
phone attribute. The schema of an XML document may be represented by an 
XML data type definition ("DTD") of the document. Figure 1 illustrates the 

10 schema of this XML document. As this figure illustrates, the schema is specified 
as a tree-like hierarchy with the nodes of the tree having parent-child relationships. 
For example, node 104 is the parent of nodes 105 and 108, which are children of 
node 104. Node 101 corresponds to the <deptlist> element and has one child node 
102, which corresponds to the <dept> element. Node 102 has two child nodes, 

15 103 and 104. Node 104 corresponds to the name attribute of the <dept> element 
and node 104 corresponds to the <employee> element. Node 104 has two child 
nodes 105 and 108. Node 105 corresponds to the <name> element and has two 
child nodes 106 and 107. Node 106 corresponds to the <first> element, and node 
107 corresponds to the <last> element. Node 108 corresponds to the <office> 

20 element and has one child node 109, which corresponds to the phone attribute. 

The mapping technique is particularly useful in situations where a legacy 
database, such as the example database of Tables 1-3, is to be accessed using 
queries designed for XML data, such as the example of Table 4. The XML 
schema may be previously defined and many different applications for accessing 

25 data based on that XML schema may have also been defmed. For example, one 
such application may be a query of the data. An example query for semi- 
structured data may be an XML transform that is designed to input data in XML 
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format and output a subset of the data in XML format. For example, a query for 
the database of Tables 1-3 may be a request to list the ID of each employee in the 
"Finance" department. The subset of that data that is output corresponds to the 
results of the query represented by the XSL transform. One skilled in the art 
would appreciate that queries can be represented in other formats such as XML- 
QL. When a legacy database is to be accessed, the data is not stored using XML 
format. Thus, in one embodiment, a query system inputs a semi-structured query 
and uses a mapping table to generate a structured query, such as an SQL query, 
that is appropriate for accessing the legacy database. The mapping technique for 
generating that mapping table is described in the following. 

Table 5 is a portion of the mapping table generated in accordance with the 
mapping technique that maps the XML schema of Table 4 to the legacy database 



of Tables 1-3. 

TABLES 



Row 


ParentName 


A/E 


ChildName 


Table 


Pkev 


Ckev 


1 


deptlist 


E 


dept 


DEPARTMENTS 




Name 


2 


dept 


A 


name 


DEPARTMENTS 


Name 


Name 


3 


dept 


E 


employee 


EMPLOYEES 


Dept 


ID 


4 


employee 


E 


name 


EMPLOYEES 


ID 


ID 


5 


name 


E 


first 


EMPLOYEES 


ID 


Fname 


6 


name 


E 


last 


EMPLOYEES 


ID 


Lname 


7 


employee 


E 


office 


EMPLOYEES 


ID 


{Bldg,Office} 


8 


office 


A 


phone 


BUILDINGSDOCS 


{Building,Office} 


phone 



The mapping table contains one row for each parent-child relationship of 
the XML schema. The mapping is further described in U.S. Patent Apphcation 
entitled "Method and Apparatus for Storing Semi-Structured Data in a Structured 
Manner." As shown in Figure 1, the XML schema defines eight parent-child 
relationships such as the relationship between node 102 and node 104. Thus, the 
mapping table contains eight rows. Each row uniquely identifies a parent-child 
relationship using the ParentName and ChildName columns. For example, the 
parent-child relationship of node 102 and node 104 is represented by row 3 as 
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indicated by the ParentName of "dept" and the ChildName of "employee." Each 
row maps the parent-child relationship to the table in the legacy database that 
corresponds to that relationship, hi the example of row 3, the Table column 
indicates that the "dept-employee" relationship maps to the EMPLOYEES table. 

5 The query system could use only the ParentName, ChildName, and Table columns 
of the mapping table to generate a structured query from a semi-structured query. 
For example, if the legacy database had used the same column names as defmed 
by the elements of the XML schema (e.g,, "employee" rather than "ID"), then only 
these three columns would be needed to generate the structured query. For 

10 example, if the semi-structured query requested an identifier of all employees 
within the fmance department and the DEPARTMENTS table contained an 
"employee" column rather than an "ID" column, then the query system could input 
a semi-structured query with only these three colxmms and generate a structured 
query. In the more general case where the columns of the legacy database are 

15 arbitrarily named, the mapping table includes a parent key column ("PKey") and a 
child key column ("CKey"). The parent key column contains the name of the 
colxmm that identifies the parent of the parent-child relationship. The child key 
column contains the name of the column that identifies the child of the parent- 
child relationship. For example, in row 3, the parent is identified by the "dept" 

20 column and the child is identified by the "ID" column in the EMPLOYEES table. 
Thus, to generate the structured query to retrieve the ID of an employee within the 
fmance department, the query that uses a select clause of 
EMPLOYEES. dept="Finance" would be used. Table 5 also includes a column 
named "A/E" to indicate whether the row corresponds to an element within the 

25 semi-structured data or an attribute of an element with semi-structured data. As 
illustrated by rows 7 and 8, some of the parent and child keys actually consist of 
multiple colunms that uniquely identify a row in the corresponding table. For 
example, the rows of the BUILDINGSDOCS table are uniquely identified by a 
combination of the Building and Office colunms. 
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The query system maps the selections within the semi-structured query to 
selections within a structured query. The following illustrates the basic format of 
that mapping when the structured query is an SQL format. 



The TABLE, CKEY, and PKEY parameters are replaced by the 
corresponding values from the row in the mapping table for the parent-child 
relationships specified by the selection. In other words, this query will find all the 
children given the key for the parent. The following illustrates the format of the 
mapping when the query represents the identification of the idea of all employees 
within the finance department. 



The query system also allows chaining of keys to effectively navigate 
through the hierarchy defined by the semi-structured data. The query system uses 
the joint concept of relationship databases to effect this chaining of keys. The 
following illustrates chaining: 

SELECT {TABLE2}.{CKEY2} 
FROM {TABLEl}, {TABLE2} 

WHERE {TABLE1}.{PKEY1} - pkey && {TABLE1},{CKEY1}= 
{TABLE2}.{PKEY2} 

The TABLEl, PKEYl, and CKEYl parameters are derived from the first 
parent-child relationship in the chain, and the TABLE2 , PKEY2, and CKEY2 
parameters are derived from the second parent-child relationship in the chain. The 
child key associated with the fu"st parent-child relationship matches the parent key 
associated with the second parent-child relationship. The following is an example 
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SELECT {TABLE}. {CKEY} 

FROM {TABLE} 

WHERE {TABLE}.{PKEY} 



pkey 



SELECT EMPLOYEES.ID 

FROM EMPLOYEES 

WHERE EMPLOYEES.Dept = "Finance' 



of the chaining to identify the building for the employees of the finance 
department. 

SELECT BUILDINGSDOCS.BUILDING 

FROM EMPLOYEES, BUILDrNGSDOCS WHERE EMPLOYEES = "Finance" && 
EMPLOYEES.BLDG = BUILDINGDOCS.BUILDING && 
EMPLOYEES.OFFICE = BUILDINGDOCS.OFFICE 

In one embodiment, the mapping table also contains the value rows 
corresponding to each leaf node, that is a node that is not a parent node. The leaf 
nodes of Figure 1 are nodes 103, 106, 107, and 109. In one embodiment, each 
value row identifies an XML element or attribute, the table in the legacy database 
that contains an element, and the name of the column in the table that contains the 
value for that element or attribute. Table 6 illustrates the four value rows for the 
mapping associated with Tables 1-3 and Table 4. 



TABLE 6 



Row 


A/E 


Name 


Table 


Key 


Value 


9 


A 


name 


DEPARTMENTS 


Name 


Name 


10 


E 


first 


EMPLOYEES 


Fname 


FName 


11 


E 


last 


EMPLOYEES 


Lname 


LName 


12 


A 


phone 


BUILDINGSDOCS 


Phone 


Phone 



The "A/E" column identifies whether the row is an attribute or element; the 
"Name" column identifies the name of the element and attributes; the "Table" 
column identifies the legacy table; the "Key" column identifies the key for that 
table; and the "Value" column identifies the name of the column where the value 
is stored. 

Table 7 illustrates a query that is to be appUed to the data of Tables 1-3. 
The query indicates to return the first and last names and phone nimiber of each 
employee in the engineering department. 
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Table 7 

WHERE 

<deptlist> 

<dept name="Engineering"> 

<employee> 

<naine><first>$first</firstxlast>$last</lastx/name> 

<office phone="$ph"/> 

</eiiiployee> 

</dept> 

</deptlist> 

CONSTRUCT 

<cmployee><name>$last, $first</name><phone>$ph</phone><yemployee> 

The data integration engine generates a "match expression" for a logical 
match operation ("LMatch") for the query when compiling the query. The logical 
match operation supports operations for performing XML navigation. The match 
expression defines a tree of navigations. Each node of the tree indicates a 
navigation type (e.g,, child, parent, or sibling), a navigation condition (e.g., a 
condition on the name of the child), whether the navigation is required, whether 
there should be a binding to the target of the navigation {i.e., a value returned with 
the specified name), and whether the result should be nested. 

Table 8 illustrates a match expression for the XML of Table 4 for the query 
of Table 7. Each row of Table 8 represents a different navigation path. For 
example, the frrst row represents a navigation path from the root of the deplist 
element to its child element of the dept element and then to the name attribute of 
the dept element. The remaining rows represent different branches on the tree. 
For example, the second row represents the branch of roo/(depUst), child(dcptX 
c^z7J(employee), c/z/W(name), and child(fkst). The symbols prefixed with 
represent bindings. 



Table 8 



roo^(deptlist) 


child{dQ\)i) 


c///7£/(name,$autol) 










c/;//£/(employee) 


child(Tmac) 


child(nisU Sfirst) 










childQasU Slast) 








c/j/7t/(office) 


child(phonc, $ph) 



Figure 2 represents a Joinin graph (JIG) for the match expression of Table 
8. The Joinin graph is a data structure that facilitates the optimization of the query 
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to be executed against the data store. This JIG indicates that the Departments, 
Employees, and Buildingdocs tables of the data store are to be joined together. 
This JIG also indicates the bindings (e.g., $first) and the join colmnns (e.g., Name 
and Dept). The format of the JIG is described below in detail. The JIG is 
5 generated from the match expression using the mapping. The data integration 
engine then generates the query to be executed. The following query is generated. 



SELECT EMPLOYEES.Fname, EMPLOYEES.Lname, BUILDINGSDOCS-phone 
FROM DEPARTMENTS, EMPLOYEES, BUILDINGSDOCS 
10 WHERE DEPARTMENTS.Name = EMPLOYEES.Dept AND 

EMPLOYEES.Bldg = BUILDINGDOCS.Building AND 
EMPLOYEES.Office = BUILDINGSDOCS. Office AND 
DEPARTMENTS.NAME = "Engineering" 

15 Figure 3 is a block diagram illustrating the overall organization of an 

execution program generated by the data integration engine. An execution 
program consist of an extract program 310 and a construct program 320. A 
compiler of the data integration engine generates the execution program during a 
compilation phase. The extract program is a series of operations on a data 

20 extracted from the data sources. The extract program represents a graph of the 
operations. The leaf nodes 311 of the extract program represents a sorted outer 
union operation appUed to the data stores 312. The compiler generates a query for 
each data store in the native query language of the data store to retrieve the results 
of the sorted outer union. The compiler generates the sorted outer union using the 

25 LMatch operation, Joinin graph, and mapping. During execution of the extract 
program, the generated query is applied to each data store. The construct program 
accesses the root node 313 of the extract program which retrieves the results 
generated by the extract program. The construct program collects the data and 
formats it into an XML output. As discussed below in more detail, the output of 

30 each operation of the extract program is in a nested conditional relation format. 
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Figures 4-9 are flow diagrams illustrating processing of the compiler of the 
data integration engiae in one embodiment. Figure 4 is a block diagram 
illustrating a function to generate an execution program. The function first 
generates the extract program and then generates the construct program. In block 

5 401, the function invokes a generate extract program function to generate an 
extract program for the specified query against the specified data stores. In block 
402, the function invokes the generate construct program function to generate a 
construct program to generate the results from the extracted data. 

Figure 5 is a flow diagram illustrating processing of the generate extract 

10 program function in one embodiment. In block 501, the function generates an 
extract plan. In block 502, the function identifies fragments of the extract plan. A 
fragment of an extract plan are the set of operations that are applied to data 
derived from a single data source. Operations that apply to data from multiple 
data sources are grouped into one fragment. In block 503, the function optimizes 

15 the operations of the fragments and then retums. 

Figure 6 is a flow diagram illustrating the processing of the generate extract 
plan function in one embodiment. In block 601, the function receives the XML 
query. In block 602, the fimction generates a match expression for the logical 
match associated with the data store. In block 603, the function creates the JoinIn 

20 graph from the match expression using the mapping for the data store. In block 
604, the function generates the native query from the JoinIn graph. The function 
indicates additional processing to generate the extract plan from the JoinIn graph. 
Blocks 602-604 illustrate the generation of the native query for the sorted outer 
union of the leaf nodes of the extract plan. The ellipses indicate other processing 

25 performed by the fimction. The function then retums. 

Figure 7 is a flow diagram illustrating processing of the match expression 
function in one embodiment. This function is passed an XML node representing 
the data store and retums the match expression. This function is recursively 
invoked for each child node of the passed XML node. In block 701, the function 
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initializes the sub-tree to the XML node. In block 702-705, the function loops 
creating a match expression for each child node. In a block 702, the function 
selects the next child node of the XML node. In decision block 703, if all the 
child nodes have already been selected, then the function returns, else the function 
continues at block 704. In block 704, the function recursively invokes the create 
match expression function passing the child node and receiving a child sub-tree in 
retum. In block 705, the function adds the child sub-tree to the sub-tree and then 
loops to block 702 to select the next child. 

Figure 8 is a flow diagram illustrating the processing of the create Joinin 
graph function in one embodiment. In block 801, the function invokes the 
generate Joinin graph passing the match expression and receiving the Joinin graph 
in retum. In block 802, the function merges nodes of the Joinin graph. In block 
803, the function processes merging of adjoining nodes of the Joinin graph and 
then retums. 

Figure 9 is a flow diagram illustrating processing of the generate Joinin 
graph function into one embodiment. This function is passed a match expression 
and retums a Joinin graph. The function is recursively invoked for each child 
node of the passed match expression. In block 901, the function sets the Joinin 
graph to a node corresponding to the root of the match expression. The function 
retrieves the mapping rows that can further the path from the root. In block 902, 
the function selects the next child node of the match expression. In decision block 
903, if all the children have already been selected, the function retums, else the 
function continues at block 904. In block 904, the function recursively invokes 
the generate Joinin graph function passing the selected child node of the match 
expression and receiving a child Joinin graph in retum. In block 905, the function 
adds the child Joinin graph to the Joinin graph and then loops to block 902 to 
select the next match expression. 

Figures 10-15 illustrate the generation of an SQL query for a sorted outer 
union node of an extract program. Figure 10 illustrates the tables of the data store. 
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The arrows between the tables illustrate joins between tables. For example, arrow 
1001 represents a join between the third coluirm of table 1.1 and the first column 
of table 2. 1. Figure 1 1 illustrates the results of the sorted outer union for the tables 
of Figure 10. Figure 12 illustrates the SQL query for each of the tables of Figure 
10 that are used to generate the sorted outer union. 

Figure 13 is a flow diagram illustrating the processing of a fimction to 
generate a sorted outer union. In block 1301, the function selects the next table of 
the source data store. In decision block 1302, if all the tables have already been 
selected, the function continues at block 1304, else the function continues at block 
1303. In block 1303, the function invokes the generate SQL query for the selected 
table and then loops to block 1301 to select the next table. In block 1304, the 
fimction executes the generate SQL queries against the tables. In block 1305, the 
function aggregates of the result of the queries into a table. In block 1306, the 
function sorts the results and then returns. 

Figure 14 is a flow diagram illustrating processing of a generate SQL query 
function in one embodiment. In block 1401, the function outputs a select, fi-om, 
and where clause for the query. In blocks 1402-1408, the function loops selecting 
each table in a join path of the data store. In block 1402, the function selects the 
next table in the path. In decision block 1403, if all the tables have already been 
selected, then the fimction returns, else the function continues at block 1404. In 
block 1404, the fimction adds the table to the from clause. In block 1405, the 
function adds the table to the where clause. In block 1406-1408, the function 
loops selecting each colunm of the selected table. In block 1406, the function 
selects the next colunm. In decision block 1407, if all the columns have aheady 
been selected, then the function loops to block 1402 to select the next table of the 
path, else the fimction updates the select clause with the column and then loops to 
block 1406 to select the next column. Columns of tables not in the selected path 
are set to null. 
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Figure 15 is a block diagram illustrating an extract program. Each of the 
leaf nodes 1501-1505 represent SQL queries that are applied to a data source. 
Node 1506 represents a nesting of the results of nodes 1501 and 1502. Node 1507 
represents a nesting of the results of nodes 1506 and 1503. Node 1508 represents 
5 a selection on the results of node 1507. Node 1509 represents a nesting of the 
results of nodes 1504 and 1505. Node 1510 represents a join of the results of 
nodes 1508 and 1509. Node 1511 represents a projection of the results of node 
1510. Node 1512 represents the construct program that accesses the extract 
program. 

10 Figure 16 is a flow diagram that illustrates code of a join node of an extract 

program in one embodiment. In one embodiment, the processing of each node of 
extract program is performed a pipeline manner, that is each node returns only the 
data needed to satisfy the next request from the construct program. In decision 
block 1601, if the right node is a fully processed, then the function continues at 

15 block 1602, else the function continues at block 1605. In decision block 1602, if 
the left node of the join is fully processed, then the function retums, else the 
function continues at block 1603. In block 1603, the function retrieves at the next 
results from the left node. In block 1604, the function initializes the right node 
based on the results retumed from the left node. In block 1605, the function 

20 retrieves the next results from the right node. In decision block 1606, if the results 
retumed from the right node are contained in nested table, then the function 
retums an iterator for that table, else the function returns the results. The iterator 
for a table is an optimization that allows nodes higher in the extract program to 
retrieve subsequent rows of the nested table without having to invoke lower-level 

25 nodes in the extract program. 

Figure 17 illustrates the output of the nodes of the extraction plan to Figure 
15. When the construct program 1712 invokes root node 1711 of the extract 
program, that invocation is propagated down to the leaf nodes. The SQL query of 
node 1701 retums result 1713, and the SQL query of node 1702 retums result 
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1714. Node 1706 indicates to nest results of nodes 1701 and 1702. In this case, 
result 1714 is nested within result 1713 as indicated by result 1715. Node 1703 
returns result 1716. Node 1707 nests result 1716 within result 1715. The 
subscript within node 1707 specifies a target for the nesting. In this case, the 
5 subscript 2 indicates to nest within the third column of result 1715. (Colunms are 
identified starting with colunm 0.) Result 1717 represents the result of the nesting. 
Node 1708 represents selection on the result 1717. The target represented by 
subscript 2. 1 indicates to select the third column and the first row within the third 
column. The result of the selection is result 1718. Results 1719-1723 illustrates 
10 the results of the other nodes of the extract program. 

LMatch Operation 

The LMatch operator performs navigation-based selection over XML 
input. The following example illustrates an XMLQL syntax fragment and the 
LMatch instance that is created to model it inside the compiler: 

15 

<a><b><c>$c</></></> ELEMENT_AS $a 

LMatch ($results, "self(a,$a)-child{h, — )-child{c,$c) ") 

20 The "5e//(a,$a) — child{h,--) — child{c,$cy' is a match expression. In this 

example, the match expression is a tree with three nodes. The general structure of 
the XMLQL pattem is translated into an isomorphic pattem within the match 
expression. The XMLQL variables become "bindings" within the navigations. 
The LMatch operator is one of the logical operators of the internal language of the 

25 data integration engine. The LMatch operator is generally the "first" operator that 
is applied to input data and is responsible for converting XML input in to NCRs 
that are then further processed by the query engine. The LMatch operator is a 
logical operator only in that one of the actions of the Compiler is to convert 
LMatch operators into a data soxirce-dependent form (e.g., SQL for relational 

30 databases, or QLL for QL-Lite data sources). 
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The LMatch operator defines a match against XML data. The pattern is 
defined by the "match expression,'* which is a tree of navigation steps. Each 
navigation step describes a "movement" from a source element or attribute to a 
target element or attribute. The parameters of the navigation step that govem 
5 navigation are as following: 

• The type of movement or navigation {child, parent, descendant etc.) The 
navigation types are based on XPath axes. 

• The name of the target element or attribute, which may be a wild card. 

• Whether the target should be an element, an attribute, or either. 
10 • Whether the navigation is optional or not. 

LMatch matching is top-down on the tree of navigation steps. That is, the 
match begins at the root of both the XML docimient and the root of the match 
expression. Matches for the first navigation step are sought in the entire XML 

15 document. If the first navigation is a root navigation, then it matches the root of 
the XML document (where we interpret root to be the root element, not the 
document item, as defined in DOM). If the first navigation step is something other 
than root, it is as a navigation fi-om the root. 

Once a node or set of nodes have been identified for the first navigation, the 

20 algorithm proceeds recursively: given a matched node, attempt each of the child 
navigations from the navigation tree (where child here means "child in the 
navigation tree," rather than child type node). Each attempted navigation will 
itself yield a new set of zero or more matches, which are then continued in the 
next level of the recursion, and so forth. While the recursion proceeds down the 

25 navigation tree, the navigations do not necessarily proceed "down" the XML tree; 
navigation types can move in arbitrary directions within the XML document (e.g., 
ancestor or precedingjsibling). 

If an attempted navigation yields zero matches from some source node, 
then that navigation is said to have failed. If the navigation was not marked as 

30 optional=true, then the failure of the naviagation causes the source node to be 
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"unmatched." The following match expression illustrates the failure of a 
navigation; 

selfiaM)—child(b,$h)—child(Q,$o) 



may yield a set of <b> elements, some of which contain <c> elements and some of 
which do not. When the final navigation is evaluated, it will, for some <b> 
elements, yield no results. If the navigation is optional (optional=true\ then all 
the <b> elements are included in the result. If, however, the navigation is required 

10 {optional=false\ then those <b> elements that contain no <c> elements are 
removed from the set of matches for child(h) from the root <a> element. The 
result contains only <b> elements that actually contain <c> elements. If no <b> 
elements remain after this process, then the failure propagates upward, 
"unmatching" the <a> elements (unless the childQd) navigation was optional). 

15 The evaluation of an LMatch operator is a three stage process: first, match 

the pattem within the LMatch operator against some source of XML; second, 
connect columns in the LMatch pattem with their associated items in the 
information set of the XML source; and thirs, structure those connected columns 
(the extracted information) into an NCR as indicated by the nesting settings on 

20 individual navigations. That is, an LMatch operator specifies a structural pattem 
that is sought after in a document, specifies which parts of that pattem should be 
retumed, and specifies how the returned parts should be organized. The output of 
an LMatch operator is an NCR that contains the retumed parts, organized as 
specified. 

25 The parameters of the LMatch that govern how results are constructed are 



5 



The first navigation step may yield a single element <a>. The second step 



these: 



• The set of columns retumed. An NCR column "names" some piece of 
information returned from an element or attribute node that has been 



matched. There are several kinds of columns: 
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o 



value: the contents of a simple element or attribute 



o 



subtree: the entire element (not applicable to attributes) 



o 



name: the name of the element or attribute 



5 



O 



text: the text value of an element (used to extract text from mixed- 
content elements; in a simple element it is equivalent to 'value') 



o table: the table column gives a name to the entire set of results when 
nested=tnie 

• Whether or not the results of the navigation should be nested. 

Each navigation step may have one or more of the column types present. 
10 The type of the column is derived from the type of the corresponding contents of 
the XML document (except for the table column). 

These columns are structured into an NCR based on the nested flag and the table 
column: If nested^true, then the table column was specified, and the navigation 
creates a nested NCR. This NCR contains all the other columns for this 
15 navigation step, as well as all the colunms generated by the subtree of navigations 
beneath it. For example: 



the LMatch operator, like other operators, provides an additional column that 
names the its entire schema. The child(h) navigation is a nested navigation that 
results in a nested NCR, named Sbtable, in the result. This NCR will contain 
columns $b (because $b is a column on the childQol) navigation) and $c (because 
25 $c is a column on a navigation in c/z/7J(b)'s subtree). Figure 18 illustrates a final 
NCR structure. 

A depth-first traversal of the match expression of an LMatch operator is 
used to construct the colimms of the output NCR. As a result, the LMatch 



5e//(a,--)— Nc/z/W(b,$b,$btable)— c/z/W(c,$c) 
—childidM) 



20 



The root (top-level) navigation may also be nested or unnested. In addition, 
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operator also defines an ordering of the columns as well as their structure and 
names. 

When a navigation matches multiple times, then the results differ based on 
whether the navigation is nested. If the navigation is a nested navigation, then a 

5 nested NCR is created, which will contain the matches. But if the navigation is 
not nested, then the results are combined via a cross-product with all the other 
columns in the same table. So, if one <b> element contained multiple <c> 
elements, the $btable would contain the corresponding <b>-<c> pairs. 
Navigations that are not nested can be treated as a special case of nested 

10 navigations. Thus, an LMatch operator can be evaluated as if all navigations are 
nested. Then, for each navigation that is actually nested, a an LFlatten operation 
can be used to remove the table corresponding to the nesting. 

A subtree column results in the entire XML subtree, tags and all, being 
retumed as an atomic value. (This corresponds to the ELEMENT AS notation in 

15 XML-QL.) The compiler transforms this column into a more complex LMatch 
expression that "pulls apart" the entire subtree contents and modifies the rest of the 
execution unit to reconstruct the result back into a subtree when needed. As a 
result, subtree columns exist initially, but they are replaced with more complex 
patterns. Before they are rewritten, the subtree columns are modeled in the NCR 

20 schema as a single, static column. After the rewrite, they begin with a table- 
valued column containing the nested results. 

Advantages of the LMatch operator being a single, complex operation 
include: 

1. When queries are generated for query languages which themselves contain 
25 some form of matching operations, then mapping onto those operations is 



2. Certain optimizations that may be done on navigational matching are better 
enabled by capturing succinctly the navigation that is being done. In 
particular, reasoning about substitution of a descendant relation with a 



enabled. 
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union of paths, and vice versa. Also, also reasoning about document order 
relations. 

3. The LMatch operator combines two kinds of capabilities into a single 
operator: navigational operations and composition of the results into a 
complex structure (the NCR). This allows a concise representation of a 
very common idiom. 

The LMatch operator can be matched against a tree that represents an XML 
generator, rather than the actual XML document. For example, 

• The XML RDB Map can be interpreted as a generator of an XML 
document from a relational database. Matching the LMatch operator 
against an XML RDB Map is a fundamental step in converting the XML 
query into SQL. 

• The Construct Program of a query can be interpreted as a generator of an 
XML document from an NCR. Matching the LMatch operator against a 
Construct Program is a fundamental step in composing views. 

The algorithm for matching against tree-structured XML generators is very 
similar to the algorithm for matching against XML input directly. One difference 
is that where matching against an XML document generates tuples of output, 
matching against a generator generally produces a Correspondence Tree, which 
encodes all the potential correspondence points between the nodes of the generator 
and the navigation steps of the LMatch. 

An XML generator is a tree (actually, a forest suffices) where the nodes in 

the tree represent the generation of XML elements or attributes or their values, and 

arcs between nodes represent inclusion. For example: 

element("person") — attribute("ssn") — value() 
— element("name'*) — valueQ 
— element("address") — value() 

The XML generator also indicates the arity of each arc. The values for 
arity are optional (0 or 1), singular (exactly one) and multiple (0 or more). If an 
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arc is marked multiple, then the generator can generate more than one instance of 
the child node for each parent instance. Li the above example, if the arc between 
"person" and "name" were marked multiple, then a person could have zero or 
more names. The arity of an arc is indicated by a subscript on the arc as shown in 
5 the following: 



10 derive arity information from the generator, then multiple is assumed, since it is 

the most general case. 

The Correspondence Tree tracks which navigation steps in the LMatch 

operator correspond with which nodes in the XML generator. The 

Correspondence Tree would be isomorphic to the LMatch navigation graph except 
15 for one thing: any given navigation step might match against multiple nodes in the 

generator. The following is an example of an XML generator, an LMatch 

operator, and the corresponding Correspondence Tree: 



Figure 19 illustrates the Correspondence Tree. 

The subscripts on nodes in the generator and LMatch distinguish otherwise 
identical nodes when they appear in the Correspondence Tree. The 
30 Correspondence Tree is "read" as: "The root navigation has a single match, 
namely the element("person")i node of the XML generator. From this generator 
node, the next LMatch navigation, child{mmt)^, is matched against two different 
generator nodes, and so on. 



element( "person") — sattribute("ssn") — valueQ 

— Melement("name") — valueQ 

— oelement("address") — valueQ 
When no arity is indicated, singular is assumed. If it is not possible to 



20 



The XML generator: 

element( "person") 1 — attribute("ssn")2 — valueQs 

— element("name")4 — valueQs 
— element("name")6 — valueQ? 



25 



The LMatch: 

5e///(person)i — child{ssn)2 

— c///W(name)3 
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The Correspondence Tree is a bipartite graph, A bipartite graph is one in 
which nodes come in two different alternating types. In this case, the node types 
are called navigation nodes (which reference navigation steps, and are pictorially 
indicated with brachets [ ]) and choice nodes (which reference generator nodes, 
5 and are pictorially indicated with braces { } ). A bipartite graph is interpreted as 
having two different kinds of arcs, which are indicated by lines of different 
weights: hght lines are choice arcs (arcs from navigation to choice nodes, 
choosing amongst multiple correspondences) and heavy lines are navigation arcs 
(arcs from choice to navigation nodes, following the navigation relationships in 
10 the LMatch operator). 

A correspondence is a (navigation step, generator node) pair of a 

correpondence tree. A correspondence is derived from a choice node by including 

the navigation step from the parent. For example, the following subgraph of a 

correspondence tree yields the following correspondence: 

15 subtree: [ child(namQ)3 ] — { S: element("name**)4 } 

correspondence: { c/z/7(i(name)3, element("name")4 } 

The following matching algorithm generates the Correspondence Tree, 
given an LMatch operator and an XML generator as input. The algorithm is a top- 
20 down recursion over the LMatch navigation graph. 

The XML generator has the following operations: 



XMLGenerator . root ( ) ordered list of GeneratorNode 
GeneratorNode. type 0 [ "element" | "attribute" | "value" ) 
GeneratorNode . name ( ) GName 

GeneratorNode. genChildren ( ) -> list of GeneratorNode 
GeneratorNode. arity(childNode) -> ( "S" | "M" | "O" } 



hi this example, the LMatch operator is limited to the following navigation 
25 types: root, child, self. The nested flag on LMatch navigation steps is irrelevant to 
matching. The LMatch operator provides the following pseudo code for accessing 
the match expression: 
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LMatch.root 0 -> NavStep 

NavStep. typeO ^ { "root" | "child" | "self" } 
NavStep. ea() { "element" | "attribute" | "either" ) 
NavStep . name ( ) NName 

NavStep. navChildren ( ) -> list of NavStep 

NavStep. optional ( ) boolean 



There is also a function, nameMatch{GName, NName) ^ boolean, that 

returns true or false as the name from a generator node matches the name of an 
5 LMatch navigation. The Correspondence Tree provides the following operations: 



CorrespondenceTree. root ( ) -> NavigationNode 
CorrespondenceTree. createRoot ( NavStep ) 

NavigationNode . new ( NavStep ) 
NavigationNode . navStep ( ) -> NavStep 

// Model . navStep () . type () 
NavigationNode. type ( ) -> { "root" | "child" | "self" } 

//And , navStep 0 .name () 
NavigationNode . name ( ) -> NName 

NavigationNode. choiceChildren ( ) list of ChoiceNode 

NavigationNode . addChoiceChild ( ChoiceNode ) 

ChoiceNode. new ( GeneratorNode, arity ) 

ChoiceNode . generatorNode ( ) -> GeneratorNode 

ChoiceNode . type ( ) { "element" | "attribute" | "value" ) 

ChoiceNode. name ( ) -> GName // ditto 

ChoiceNode. arityO -> { "S" | "M" I "O" } 

ChoiceNode.navChildrenO -> list of NavigationNode 

ChoiceNode.addNavChild( NavigationNode ) 



The following illustrates the BuildCorrespondence function that is invoked 
to build a Correspondence Tree for an XML generator and an LMatch operator: 

10 



// Assume a rooted LMatch match expression; 

// normalize the LMatch to make this true if necessary. 

BuildCorrespondence ( XMLGenerator g, LMatch Im ) 

{ 

// create the correspondence tree 

ct <- new CorrespondenceTree 

nn <- new NavigationNode ( Im.rootO ) 

// bootstrap the first level of expansion, 

// matching root against roots 



// ditto 
// ditto 
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ct . addRoot ( nn ) 

foreach{ gn in g.getRootO ) { 

cn <- new ChoiceNode ( gn, "M" ) 
if ( addNavs { nn, cn ) ) 

nn->addChoiceChild( cn ) 

} 

return ct 

) 



The form of the algorithm is mutual recursion between two functions, each 
of which extends the graph by one level, or fails to do so (because there is no 
match). The subroutines retum boolean values indicating whether or not they 
5 were successful; this value is then used to determine whether or not to continue 
and whether or not to actually add nodes to the graph. The following is the pseudo 
code for the addNavs function: 



// From a given corresponding navigation and choice node pair, 
// extend the choice node for each child navigation of the 
// navstep, 

boolean addNavs ( NavNode nn, ChoiceNode cn ) 
{ 

foreach( step in nn . navStep ( ) . navChildren ( ) ) { 
stepnavnode <- new NavigationNode ( step ) 
success <- addChoices ( cn, stepnavnode ) 
// if a navigation is optional ^ we include the navNode^ 
// even if it failed (the navNode will have no choice children) 
if ( success II step. optional ( ) ) 

cn->addNavChild( stepnavnode ) 
else // failure of a required navigation; abort 
return false 

} 

// if no required navigation failed, return true 
return true 

) 



10 The following is the pseudo code for the addChoices function: 



// Given a location in the generator and a requested 

// navigation, "follow" the navigation in the generator tree, • 

// finding a new layer of correspondences. 

boolean addChoices ( ChoiceNode cn, NavNode nn ) 

{ 

success <- false 

foreach( gn in follow ( cn . generatorNode ( ) , nn.navStepO ) { 

choicenode <- new ChoiceNode { gn, cn . generatorNode (). arity (gn) ) 
thissuccess <- addNavs ( nn, choicenode ) 
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if ( thissuccess ) { 

nn->addChoiceChild( choicenode ) 
success <- true 

} 

) 

// return true if at least one choice worked out 
return success; 



The following is the pseudo code for the follow function: 



// Inplement the actual navigation; this would be extended 
// with more types of navigation as the LMatch is extended 
// (and would probably require the generator to support more 
// powerful navigations as well^ at least parent ()) . 
List<GeneratorNode> follow ( GeneratorNode gn, NavStep nav ) 
{ 

List<GeneratorNode> result <- (); 
switch ( nav, type 0 ) { 
case "self" : 

if ( nameMatch ( gn . name ( ) , nav. name ( ) ) 
result. add ( gn ) 
case "child" : 

foreach ( gnkid in gn , genChildren ( ) ) 
if ( (nav.eaO == "element" || nav.eaO == "either") 
&& gnkid. type 0 == "element" 
&& nameMatch ( gn.nameO, nav.nameO ) ) 
result. add ( gn ) 
else if ( (nav.eaO == "attribute" || nav.eaO == "either") 
&& gnkid. type 0 == "attribute" 
&& nameMatch ( gn.nameO, nav.nameO ) ) 
result. add ( gn ) 
case "root" : 

foreach ( r in xmlGenerator . root ( ) ) // [1] 
i f ( namemat ch ( r . name ( ) , nav . name ( ) ) 
result. add ( r ) 

} 

return result 

J 



The BuildCorrespondence algorithm presented above does not match 
against actual XML data. However, an XML document may be considered a 
degenerate XML generators with singular-arity arcs and constant value nodes and 
and NCR is built rather than a Correspondence Tree . The relationship between a 
Correspondence Tree and an NCR is as follows: 
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• The values for any particular navigation are the concatenation of the values 
for each choice below that navigation; the result is a list of data. 

• For nested navigations, the rows of the nested table are that list of data. 

• For an unnested navigation that is at most singular, then the list of data can 



that is filled in the by value. 
• In general for unnested navigations, the list of data is "joined against" the 
existing rows of the containing table. If the navigation is optional, the join 
is an outer join, if required, an inner join. If the list has multiple entries, the 



Because navigations can result in failure that propagates recursively 
upwards, matches to the leaves are evaluated before committing to any results. 
Altematively, the LMatch operation could contain only optional navigations or 
only required navigations in cases where the data will be present. Similarly, it is 
15 possible to eliminate the need to handle joins or cross products by limiting the 
LMatch operator to only allow unnested navigations when the data is at most 
singular. 

Two type of normalization that can be performed on LMatch operators are 
removal of (non-root) self navigations and removal of impUcit cross-products. 

20 The normalized LMatch operator would consist only of a single root self 
navigation and following child navigations, where for each child navigation, 
nested=tnie. Altematively, the normalization could cover either (nested=tnie) or 
(nested=false and optional =false and the child is known to exist in a strict 1:1 
relationship with the parent). Additional normalizations, such as requiring 

25 optional=true on all nested child steps, may also possible. 

To normalize the LMatch operator, additional operators are inserted to the 
Logical Extract Program to compensate for the changes to the LMatch operator. 
These logical operators include the LSelect, LFlatten, and LBox operators. The 
LSelect operator removes tuples from a table based on some condition. The 



5 



contain only 0 or 1 rows. In this case, the NCR column is essentially a field 



10 



effect is a cross product against the other contents of the table. 
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LFlatten operator flattens a nested table within an NCR. The operator is applied to 
a single nested table, and the process of flattening removes that table. The 
LFlatten operator has a boolean parameter "outer" indicating whether the 
flattening operation should behave like an inner or left outer join — that is, if the 

5 nested table is empty, does flattening remove the containing row or not. The 
LBox operator serves to introduce an artificial level of nesting within a table. 

A singular relationship between a child navigation and its parent navigation 
is identified by examining the XML schema of the data that the LMatch operates 
against, hiitially, the matching algorithm has been run. After that, it can be 

10 determined, for each navigation step, which place(s) in the schema the LMatch 
operator could match. From that information, and from the cardinality 
information available in the schema, it can be identified whether the singular 
condition holds. 

The first version of the algorithm generates an LMatch that contains a 
15 single, top-level self navigation and otherwise contains only child navigations. All 
navigations (including the self at the top) have nested=true. The resulting 
navigations may have optional=-true or optionaHfalse. The implementation can 
be styled in a bottom-up or top-down traversal, but note that in either case 
compensating operators are to be inserted at both the bottom and top of the chain. 
20 The table below illustrates the various cases that can arise. The right-hand 

column has examples of the transformations. Here is a sample XML document 
this can be tested against: 

<a>al<b>bl</b> 

<c>cl<d>dl</d></c></a> 
<a>a2<b>bl</b><b>b2</b> 

<c>cl<d>dl</d></c> 

<c>c2<d>d2</d></c></a> 

<a>a3 

<c>cl<d>dl</d></c> 

<c>c2<d>d2</d><d>d3</d></c></a> 
<a>a4<b>bl</b></a> 
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Figures 20-25 illustrate normalization. 



If the step is the root self step of the navigation tree, and has 
nested=false, treat it the same way as a child step with nested=false 
(see below). 


Figure 20 


If the step is a child step with nested=true, do nothing. 




If the step is a child step with nested=false, set nested=true, 
autogenerate a table column for the new nested table, and add an 
LFlatten operator to compensate. The LFlatten target is set to the 
nested table that this step generates. 

If the child step has optional^true, the LFlatten must be an "outer" 
LFlatten. 


Figure 20 
Figure 21 


If the step is a self siQp with optional =false, and nested=false, 
remove the step, migrating the children of the self step to the parent. 
If the self step had colimms on anything, add those columns to the 
parent step. If one of the migrated columns is a duplicate of an 
existing column of the parent step, use renaming to remove one of 
the columns from the entire LEP. 


Figure 22 


If the step is a self step with optional=true and nested=^true, remove 
the step, migrating the children of the self step to the parent. 
If the self step had columns on anything other than its table column, 
add those columns to the parent step, unless it would clash with an 
existing column on the parent step. In that case, add an LDup 
operator to make a new copy of the self step's column. 
Determine the total set of columns that l)elong to' the self step 
(including the result of the LDup, if any), and insert an LBox 
operator to nest those columns, giving the result the original table 
name from the omitted self The LBox operator is inserted after the 
LDup, if there is one, but before all other steps. 


Fig 23 


If the step is a self step with optional=false and nested=true, 
proceed as in the case above. Then add an LSlelect operator to test 
for emptiness of nested table created by the LBox. Unlike other 
operator additions, this LSelect operator must be added to the end of 
the chain of operators that have been added, so that it operates only 
after any flattenings have been done at deeper levels of nesting. 


Fig 24 


If the step is a self step with optional^true and nested=false, 
proceed as in the case above, except instead of an LSelect step at the 
end, insert an LFlatten with outer=true. 





In one embodiment, the following optimization may be applied. If an LBox 
is followed by the flattening of all its columns, the nested tables can be joined with 
5 a sequence of LJoin operators (as cross products) instead. This optimization could 
be performed either during this algorithm, or as a post-processing step. To 
illustrate, the last example above could be rewritten as shown in Figure 25: 

Alternatively, the normalization can be modified to state that only 
nested=true are added to child steps that can have multiple (or, possibly, optional) 
10 values. This normalization is may be easier for inputs to create NCR* in which 1:1 
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elements are listed as flat columns of a row; any nesting on these columns may 
need to be added by an explicit LBox operator. In the case of a child step that has 
nested=false, and the step has been marked as singular without changing the value 
of the nested flag and without adding an LFlatten operation. The other steps do 
not change; in particular elision of a self step in the general case may result in 
adding an LBox, possibly followed by a LSelect or LFlatten operation. However, 
if all child navigations of a self navigation are singular, then the LBox and 
corresponding LFlatten can be omitted. The corresponding LSelect needs to be 
changed to a test on the NULL-ness of the columns, rather than a test on the 
emptiness of a nested table. This condition can be detected in a post-processing 
step, but it would require information from both the LMatch (the singularity of 
steps) and correlated information from the logical extraction program (the 
presence of LBox and LFlatten/LSelect); thus, this optimization may be 
implemented as an integral part of the recursive algorithm. 

Nested Conditional Relations (NCR) Model and Algebra 

NCR extends relational algebra in two ways. First, it makes relations heterogeneous {i.e., 
allows them to contain records of different types). Each record is accompanied by a tag, 
describing its type, hence the term conditional relation. Second, relations can be nested. 
The value of an attribute can be either atomic (e.g., int, float, string) or another NCR. 

1.1 Relational Algebra 

Traditional relational data models have tables that are homogeneous and flat and selection and 
projection operators select a subset of rows and fields in a table. A homogenous table is one 
that has rows of the same type. A flat table has atomic fields, that is fields in the first normal 
form. The following table is a traditional relational data model. 
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Depts 



Name 


ID 


Phone 


Floor 


Payroll 


P 


2345 


3 


Payroll-Temps 


P2 


2244 


1 


PayroU-NJ 


P3 


2345 


4 


Engineering 


E 


7654 


6 


ReEngineering 


E2 


2244 


3 


Marketing-US 


MU 


1818 


4 


Marketing-Europe 


ME 


9876 


2 



Each row in the table is a record of the same type: 

[Name : string, ID : string. Phone : int, Floor : int] 
5 Each field in the row is an atomic type. The table is a set of such rows. Its type is: 
Depts: {[Name : string, ID : string, Phone : int. Floor : int]} 

The following selection operation (o): 

^ Floar>5 (Dcpts) 

10 

resuhs in a subset of the original table, consisting of the highlighted rows below: 



Name 


ID 


Phone 


Floor 


Payroll 


P 


2345 


3 


Payroll-Temps 


P2 


2244 


1 


PayroU-NJ 


P3 


2345 


4 


Engineering 


E 


7654 


6 


ReEngineering 


E2 


2244 


3 


Marketing-US 


MU 


1818 


4 


Marketing-Europe 


ME 


9876 


2 



The following projection operation (IT): 

15 n Name. Phone (CJ FIoor>3 (Dcpts)) 

results in a subset of the original table, consisting of the highlighted rows below: 
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Name 


ZD 


Phone 


Floor 


Payroll 


P 


2345 


3 


Payroll-Temps 


P2 


2244 


1 


PayrolI-NJ 


P3 


2345 


4 


Engineering 


E 


7654 


6 


ReEngineering 


E2 


2244 


3 


Marketing-US 


MU 


1818 


4 


Marketing-Europe 


ME 


9876 


2 



The Nested Conditional Relation model (NCRs) has tables that are heterogeneous and nested 
and have generalized version of selection and projection the select a subset of the fields. 

1.2 Conditional Relations 

5 A heterogeneous collection, or conditional relation is a relation which may have rows of 
different types. The Dept and Persons persons tables below are of the traditional relational 
model. 



Depts 



Name 


ZD 


Phone 


Floor 


Payroll 


P 


2345 


3 


Payroll-Temps 


P2 


2244 


1 


Payroll-NJ 


P3 


2345 


4 


Engineering 


E 


7654 


6 


ReEngineering 


E2 


2244 


3 


Marketing-US 


MU 


1818 


4 


Marketing-Europe 


ME 


9876 


2 



Persons 



SSN 


Name 


Salary 


123456789 


Smith 


44444 


234567890 


John 


55555 


111111111 


Sue 


66666 



A heterogeneous table consisting of departments and persons is obtained by interleaving the 
rows of the Depts and Persons tables as shown below: 
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DepartmentsPersons 



Dept: 
Dept: 
Pers: 
Pers: 
Dept: 
Dept: 
Pers: 
Dept: 
Dept: 

Dept: 



Name 


ZD 


Phone 


Floor 


Payroll 


P 


2345 


3 


Name 


m 


Phone 


Floor 


Payroll-Temps 


P2 


2244 


1 


SSN 


Name 


Salary 


123456789 


Smith 


AAAAA 
f 1 I 1 I 


SSN 


Name 


Salary 


234567890 


John 


55555 


Name 


ZD 


Phone 


Floor 


PayroU-NJ 


P3 


2345 


4 


Name 


ZD 


Phone 


Floor 


Engineering 


E 


7654 


6 


SSN 


Name 


Salary 


111111111 


Sue 


66666 


Name 


ID 


Phone 


Floor 


ReEngineering 


E2 


2244 


3 


Name 


ZD 


Phone 


Floor 


Marketing-US 


MU 


1818 


4 


Name 


ZD 


Phone 


Floor 


Marketing- 
Europe 


ME 


9876 


2 



10 



The Departments rows have four fields and the Persons rows have only three fields. To 
represent such a table, a /ag- is added to each row. The value of the tag can be either Dept or 
Pers. Each row has a structure that depends on this tag. The type of such a row is called a 
tagged union type, and is denoted as: 

<Dept: [Name ; string, ID : string. Phone : int, Floor : int] | 
Pers: [SSN : int. Name: string, Salary; int]> 
A value of this type is either a record of type [Name : string, ID : string, Phone : int. Floor : int] 
preceded by the tag Dept, or a record of type [SSN : int, Name: string. Salary: int] preceded by a 
tag Pers. 



The type of the entire table DepartmentsPersons is a set of a tagged union type: 



DepartmentsPersons: {<Dept: [Name : string, ID : string, Phone : int. Floor : int] | 
Pers: [SSN : int. Name: string, Salary: int]>) 

15 

The following selection operator selects rows with all departments above the 3"* floor: 

o <Dept/(Fioor>3)> (DcpartmentsPersons) 
The "Dept" tag in the condition {i.e., "Dept/(Floor>3)") indicates to select rows with a Dept 
tag. The "Floor>3 " indicates to select rows that have Floor>3. All rows that do not have the 
20 Dept tag are selected intact. Thus, the type of the result is: 
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{<Dept: [Name : string, ID : string, Phone : int, Floor : int] | 
Pers: [SSN : int. Name: string, Salary: int]>} 

The result is the highlighted rows as shown below: 



Dept: 


Ncane 


ZD 


Phone 


Floor 


Payroll 


P 


2345 


3 


Dept: 


Name 


ZD 


Phone 


Floor 


Payroll-Temps 




2244 


1 


Pers: 


SSN 


Name 


Salary 


123456789 


Smith 


AAAAA 
II III 


Pers: 


SSN 


Name 


Salary 




John 


55555 


Dept: 


Name 


LU 


Phone 


Floor 


PayroU-NJ 


P3 


2345 


4 


Dept: 


Name 


ZD 


Phone 


Floor 


Engineering 


E 


7654 


6 


Pers: 


SSN 


Name 


Salary 


111111111 


Sue 


66666 


Dept: 


Name 


ZD 


Phone 


Floor 


ReEngineering 


E2 


2244 • 


3 


Dept: 


Name 


ZD 


Phone 


Floor 


Marketing-US 


MU 


1818 


4 




Name 


ID 


Phone 


Floor 


Dept: 


Marketing- 
Europe 


ME 


9876 


2 



The following selection operation selects departments above the 3"* floor AND people earning 
more than 50000: 

o <Dept/(Fioor>3) | Pers/(Saiary>5oooo)> (DepartmeiitsPersons) 

The condition applies to both Dept rows and Pers rows. For Dept rows, the condition 
specifies FIoor>3; for Pers rows, the condition specifies Salary>50000. The result consists 
of the highlighted rows below: 
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»"OSOj1-OJ, 



DepartmentsPersons 



Dept: 
Dept: 
Pers: 
Pers: 
Dept: 
Dept: 
Pers: 
Dept: 

Dept: 
Dept: 



Name 


ZD 


Phone 


Floor 


Payroll 


P 


2345 


3 


Name 


ZD 


Phone 


Floor 


Payroll-Temps 


P2 


2244 


1 


SSN 


Name 


Salary 


123456789 


Smith 


AAAAA 
III I 1 


SSN 


Name 


Salary 


234567890 


John 


55555 


Name 


ZD 


Phone 


Floor 


x'ayroii-JNJ 


P3 


2345 


4 


Name 


ZD 


Phone 


Floor 


Engineering 


E 


7654 


6 




Name 


Salary 


111111111 
llillllil 


Sue 


66666 


Name 


ZD 


Phone 


Floor 


R pP n pi n ppri ri o 


E2 


2244 


3 


Name 


ZD 


Phone 


Floor 


Marketing-US 


MU 


1818 


4 


Name 


ZD 


Phone 


Floor 


Marketing- 
Europe 


ME 


9876 


2 



The same result can be achieved by applying the two selections in sequence, that is: 
CJ <Dept/(Fioor>3) | P€rs/(Saiary>5oooo)> (DepartmentsPersons) = 

= O <Dept/(Floor>3)> (C7<Pers/(Salary>50000)> (DepartmeiltsPerSOns)) 

= o<Pers/(Saiary>5oooo)> (cy<Dept/(Fioor>3)> (DepartmentsPcrsons)) 



The following projection operation projects out the Name and Phone fields for the Dept rows 
and the Name field for the Pers rows: 

10 n <Dept:[Naine,Phone] | Pers:[Naine]> (CT <Dept/(noor>3) | Pers/(Salaiy>50000)> (DepartmCntsPerSOns)) 

The type of the result is: 

{<Dept: [Name : string, Phone : int] | Pers: [Name: string]>} 
The results consist of the highlighted fields below: 
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^^^^ 



DepartmentsPersons 



Dept: 
Dept: 
Pers: 
Pers: 
Dept: 
Dept: 

Pers: 

Dept: 

Dept: 
Dept: 



Name 


ZD 


Phone 


Floor 


Payroll 


P 


2345 


3 


Name 


ZD 


Phone 


Floor 


Payroll-Temps 


P2 


2244 


1 


SSN 


Name 


Salary 


123456789 


Smith 


AAAAA 
1 1 I 1 1 


SSN 


Name 

John 


Salary 


234567890 


55555 


Name 


ZD 


Phone 


Floor 


PayroU-NJ 


P3 


2345 


4 


Name 


ZD 


Phone 


Floor 


Engineering 


E 


7654 


6 


SSN 


Name 


Salary 


111111111 


Sue 


66666 


Name 


ZD 


Phone 


Floor 


ReEngineering 


E2 


2244 


3 


Name 


ZD 


Phone 


Floor 


Marketing-US 


MU 


1818 


4 


Name 


ZD 


Phone 


Floor 


Marketing- 
Europe 


ME 


9876 


2 



If only a subset of the tags are mentioned in the projection operator, then all rows tagged 
with the other tags are left unchanged in the result. Thus, the expression above is 
equivalent to: 

n <Dept:[Nanie,Phone] | Pere:[Name]> (CJ <I>ept/(Iiloor>3) | Pers/(SaJary>50000)> (DepartmcntsPerSOns)) 

= n 

<Dept: [Name, Phone ]>(n<PeR:[Name]>(o<Dept/(Floor>3)>(CJ<PeiV(Salary>50000)>(DepartmentS^ 
= n <Dept:[Name,Phone]>(cy<Dept/(Floor>3)>(n<Pere:[Name]> (O<Pere/(Salai7>50000)>(Depart!lientsPerSOns)))) 



10 1.3 Nested Conditional Relations 



The following table illustrates NCRs. The Pers rows have an Assignments field that is a non- 
atomic field. The field contains a nested condition relation in that its sub-rows can be of type 
Project or Committee. 
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DepartmentsPersons 



Name 


ID 




Floor 


r^ayroii 


r 


2345 


3 


Name 


ID 




Floor 


Pa vrnl 1 -Xpmn Q 




2244 


1 




Name 


Assignments 






Project: 


Name 


Lang 








Compiler 


C-H- 






Project: 


Name 


Lang 








Optimizer 


C++ 






Committee: 


Name 










Awards 








Project: 


Name 


Lang 








Wrapper 


Java 




Name 


Assignments 






Project: 


Name 


Lang 








Compiler 


C++ 






Project: 


Name 


Lang 








Wrapper 


C++ 




Jonn 


Committee: 


Name 










Awards 








Committee: 


Name 










Promotion 








Committee: 


Name 










Disciplinary 






ID 


Phone 


Floor 


Payroll-NJ 


P3 


2345 


4 


Name 


ID 


Phone 


Floor 


ringinecring 


xi 


7654 


6 


SSN 


Name 


Assignments 






Project: 


Name 


Lang 








Compil 


er 


Java 






Project: 


Name 


Lang 


111111111 
lillllili 


oue 




Optimizer 


C++ 






Committee: 


Name 










Promotions 








Project: 


Name 


Lang 








Wrapper 


Java 


Name 


ZD 


Phone 


Floor 


ReEngineering 


E2 


2244 


3 


Name 


ID 


Phone 


Floor 


Marketing-US 


MU 


1818 


4 


Name 


ID 


Phone 


Floor 


Marketing-Europe 


ME 


9876 


2 



Dept: 
Dept: 



Pers: 



Pers: 



Dept: 
Dept: 



Pers: 



Dept: 
Dept: 
Dept: 



The type of this table is: 
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{<Dept: [Name : string, ID : string. Phone : int, Floor : int] | 
Pers: [SSN : int. Name: string. 

Assignments: {<Project: [Name: string, Lang:string] | 
Committee: [Name: string]>}]>} 



5 



Rows tagged with Dept are flat records. The rows tagged with Pers have an Assignments 
field with its value as another NCR, with its rows tagged either v^th Project or with 
Committee. In this example, there is only one level of nesting, in general, however, there 
may be arbitrary levels. 

10 The follov^ng projection operation select the Name and Phone fields of the Dept rows and 
the Name field of the Pers rows. 

n <Dcpt:pVame^haneLPers:pSrame]> CDepaJtmentsPeTSOns) 

The result is a flat relation of type: 

{<Dept: [Name : string. Phone : int] | Pers: [Name: string]>} 

15 Projections and selections can be combined and applied to the inner relations. This is done 
with combined operator, called Combo, that does both the selection and the projection. A 
combo operator takes an argument p, called a p-former^ that describes what selections and 
projections are to be done. The p-former generalizes both the argument in a selection, Qp, and 
that in a projection Hp. The following is an example of a p-former: 

20 p = <Dept/(Floor>3):[Name,FIoor] | 
Pers/(Name like "S%"): 

[Name, Assignments: {<Project/(Lang="C++"):[Name]| 

Committee/(Name="Promotions")[Name]>}]> 

25 Then the combo operator is written as: 
Zp(DepartmentsPersons) 
This combo operator applies the selection condition (Floor > 3) and projects on the 
Name and Floor fields for the Dept rows. The combo operator also selects on the 
condition (Name like "S%") then projects on Name and Assignments fields on the Pers 

30 row. Furthermore, this combo operator processes Assignments recursively as follows. 
This combo operator applies the selection condition (Lang = "C-H-") and projects on the 
Name on Project rows and selects on (Name = "Promotions") and projects on the Name 
field on Committee rows. The type of the resuh of this combo operator is: 

35 {<Dept: [Name : string. Floor : int] | 

Pers: [Name: string, Assignments: {<Project: [Name: string] | 

Committee: [Name: string]>}]>} 

The results of this combo operator is the highlighted fields below: 
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DepartmentsPersons 



Ncane 


m 


Phone 


Floor 


Payroll 


P 


2345 


3 


Ncane 


ID 


Phone 


Floor 


Payroll-Temps 


P2 


2244 


1 


SSN 


Name 


Assignments 


123456789 


Smith 


Project: 
Project: 


Name 


Lang 


Compiler 


C++ 


Name 


Lang 


Optimizer 


C++ 


Committee: 
Project: 


Name 




Awards 


Name 


Lang 


Wrapper 


Java 


SSN 


Name 


Assigrments 


234567890 


Tnhii 


Project: 

Project: 

Committee: 

Committee: 

Committee: 


Ncane 


Lang 


Compiler 


C++ 


Name 


Lang 


Wrapper 


C++ 


Name 




Awards 


Name 


Promotion 


Name 


Disciplinary 


Name 


m 


Phone 


Floor 


Payroll-NJ 


P3 


2345 


4 


Name 


m 


Phone 


Floor 


Engineering 


E 


7654 


6 


SSN 


Name 


Assignments 


111111111 


Sue 


Project: 


Name 


Lang 


Compiler 


Java 


Project: 
Committee: 


Name 


Lang 


Optimizer 


C++ 


Name 




Promotions 


Project: 


Name 


Lang 


Wrapper 


Java 


Name 


ZD 


Phone 


Floor 


ReEngineering 


E2 


2244 


3 


Name 

Marketing-US 


ZD 


Phone 


Floor 

4 


MU 


1818 


Name 


ID 


Phone 


Floor 


Marketing-Europe 


ME 


9876 


2 



Dept: 
Dept: 



Pers: 



Pers: 



Dept: 
Dept: 



Pers: 



Dept: 
Dept: 
Dept: 
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The Combo operator selects a submatrix via a combination of selections (on tags and 
predicates) and projections (on fields). 



2 Operands and Typing 

An NCR is strongly typed which will allow type inference and type checking. The basic 
5 manipulable types are tables and single-valued attributes. The basic types are primitives or 
sets which are defined as: 

Base or primitive types: 

b ::= integer | long integer | string | float | decimal[precision*accuracy] | ... 
Manipulable types: 

10 t : := b Member of base type 

t ::= {u} Set (table) of tuples from "union type" 

Record (/>., row), variant, and union types are defined as: 

Record type: 

15 r::=[ai:ti, a2:t2, -••,an:tn] 

where ai , a2 . . . an are distinct labels called attributes, or fields, or just labels. 

Variant type: 

V ::= tag:r 

where tag is a label, called the tag of the variant type, 

20 Union type: 

u ::=<vi I V2 I ... I Vn> 
where vi , V2 , . . . , Vn are variant types having distinct tags. 

In the following example, a table is defined that includes both People(nanie: String, ssn:int) 
25 rows and £mployees(name: String, ssnrint, salary: float) rows. The record types for People 
and Employees are defined as follows: 

rpeopie [name: String, ssn:int] 

rempioyees ::= [name: String, ssn:int, salary :float] 

30 The variants are defined as: 

Vpeopie ::=People: [name: String, ssn:int] 

Vemployees ::= Employees: [name:String, ssn:int, salary:float] 

and the union type is: 

35 UpeopieEmpioyee : :^<PeopIe: [name: String,ssn:int] | 

Employees:[name:String,ssn:int,salary:float]> 
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The NCR type describing a table consisting of both people and employees is: 



tpeopieiempioyees {<People: [namciString, ssn:int] | 

Employees: [name:String, ssn:int, salary: float]>} 



Operators 

5 A formal algebra using these types is defined using the following general principles. 

• Each operator should support tables with heterogeneous rows (/.e., a table of 
union types). 

• A different action can be specified for an operator for each unique row type. 

• Operators focus on the common case and a generalized map operator handles 
10 complex cases. 



2.1 Project 

The Project operator returns a different subset of the attributes from each variant type. The 
project operator, denoted Tip, is parameterized by a list, p, of elements of the form tag: [set of 
15 attributes]. "P" is called a projection p-former. It determines both the type of the project 
operator and its semantics (which colunms are being projected out). 

Defn , projection p-former: 

Given a union type with u = <tagi : ti| ... |tagm: tn>, where ti = 
[an :t„,ai2 :ti2,...,a,„^ :t,„J, tn= : t,,,,a^ : t^,...,a^^ : t^ J, a projection p- 
20 former with input u is an expression: 



p = <tagJaii^^,aH^^,...,aii^J,tagiJa2i^^,a2i^^,... 



,a2i^^^ ],..., tagi^ [a^^^^ , a^^^^ v - , a^i^^ ] > 



25 



where the indexes satisfy: 
1< ii <i2<...<iq<m, 




30 



The "type" of the projection p-former is: 
p :: u ^ u' 

where u* is defined as follows. Let t/, j=l,. . .,m be: 
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10 



15 



fj = tj, when 

The output type is u' = <tagi: t\\ ... Itagm". t'n>- 
Defn. Project Operator: 

Let p, u, u* be as above, and let t={u}, t-{u'}. The projection operator is: 

np(t) = f 

The project operator only affects tuples with tags that are mentioned in p. Tuples with 
other tags are simply copied to the output, unaffected. The following table shows the 
results of applying a project operator. The input table is an EmpsDeptsSites NCR with 
Emp, Dept, and Site tags. 



Emp: 

Dept: 

Emp: 

Site: 

Emp: 

Dept: 

Emp: 

Emp: 

Dept: 

Emp: 



ndfTie 


ssn 


sal 


phone 


Bob 


123-45-6789 


83000 


123-4567 


name 


loc 


mgr 




Payroll 


HQ 


Bob 


name 


ssn 


sal 


phone 


Marilyn 


321-54-9876 


78000 


487-0128 


name 


city 




HQ 


San Jose 


name 


ssn 


sal 


phone 


Qing 


673-82-3845 


39000 


674-3834 


name 


loc 


mgr 




Quality Ctrl 


HQ 


Marilyn 


name 


ssn 


sal 


phone 


Betsy 


233-23-6352 


75000 


234-3473 


name 


ssn 


sal 


phone 


Brian 


341-69-0323 


33000 


236-5325 


name 


loc 


mgr 




Sales 


HQ 


Betsy 


name 


ssn 


sal 


phone 


Sam 


356-02-6743 


43000 


672-7832 



The following projection operator is applied to the EmpsDeptsSites table: 

n<Emp:[name] | Dept: [name,mgr] | Site: [name,city]>(EmpsDeptsSites) 

The result is the following table: 
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Emp: 

Dept: 

Emp: 

Site: 

Emp: 

Dept: 

Emp: 

Emp: 

Dept: 

Emp: 



narne 




Bob 


natve 


mgr 


Payroll 


Bob 


name 




Marilyn 


natne 


city 


HQ 


San Jose 


name 




Qing 


name 


mgr 


Quality Ctrl 


Marilyn 


name 




Betsy 


name 


Brian 


name 


mgr 


Sales 


Betsy 


name 




Sam 



The project operator is equivalent to n<Emp:[name] i Dept: [namt^mgr] >(EmpsDeptsSites). The Site 
records are included in the result by default. 



5 2.2 Select 

The Select operator returns a different subset of records for each variant type. The select 
operator, denoted ap, is parameterized by a list, p, of elements of the form tag/condition. "P" 
is called a selection p-former. It determines both the type of the select operator and its 
semantics. In the following, conditions and expressions are defined. 



10 Defn. Expressions: 

Given k record types ri, . . rk, and a type t, an expression e of type t with 
arguments ri, rk, in notation: 
e : ri X . . . X Tk — > t 

is defined inductively below. Expressions occur in a context^ which is defined to 
15 be a sequence of record types, (ri*, . . rn'); unless specified, the context is 

empty, n=0. 

1. Attribute: if n = [..., a : t, ...], then $i.a is an expression of type t. 

2. Context: if rj' = [..., a : t, ...], then $[j].a is an expression of type t. 
3^ Scalar operator: if ei, e2 are two expressions of base types bi, b2 

20 respectively, then ei op e2 is an expression of type b, where op is an 

operator; 

op : bi X b2 -> b 

op is one of (+, *, /), or a string operator (concat, substr), or any user- 
defined function on scalar values. 
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4k NCR expression: if ei, e2, ... are expressions of types ti, t2, . . then 
f(ei,e2, . . . ) is an expression of type t, where f is an NCR expression f ti x 
t2X ... -^t 

When k=l, then only $1 can occur {i.e., not $2, $3, . . .) and $1 .a is abbreviated 
5 with a. 

Contexts are not used in selections, but are used later in the combo operator. 

Defn, Conditions (predicates) on records: 

Given k record types ri, . . rk, a condition with arguments ri, rk, in notation: 
10 c : ri X . . . x rk -> bool 

is defined inductively as follows: 

L if ei, 62 are expressions of base types bi, b2 respectively, both with 

arguments ri, rk, then ei oprel e2 is a condition with arguments ri, rk, 
where oprel is < <=, >, >= or string operations such as substr, prefix, 
15 suffix, like, etc. 

Z if e is an expression of type {<...| tag:[ai:ti,...,an:tn] |...>} and ei, Cn 
are expressions of types ti, . . tn respectively, then: 

<tag:[ai:ei, an:en]>INe 
is condition. 

20 3^ if e is an expression of type {...), then: 

exists(e) 
is a condition. 
4^ true, false are conditions. 

5^ if cl, c2 are conditions, then so are Ci and C2, Ci or C2, not ci. 

25 

Defn, Selection p-former: 

A selection p-former is an expression: 

p= <tag,/cj...|tagi/c, > 

30 

and its "type" is: 

p::<tagi:ri | ... |ta&,:rn> ^ < tag^^ : t^^ | ... | tag,^ : t^^ > 
where i,,...,ik ^{1, • ,n}, and Cj is a condition, Cj : r^. ^bool, forj=l,...,k. 

35 Defn. Selection operator: 

Let p be a selection p-former. The selection operator is: 

cyp({t}) = {t} 
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The selection condition(s) only apply to tuples with tags mentioned in p. Tuples with 
other tags are simply copied to the output, unaffected. 

When the following select operator is applied to the EmpsDeptsSites table described above 

O <Emp / (sal>50000) | Dept / (mgr="Betsy") | Site / (true)> (EmpsDeptsSitCS) 

the resuh is: 



Emp: 

Emp: 

Site: 

Emp: 

Dept: 



Name 


ssn 


sal 


Phone 


Bob 


123-45-6789 


83000 


123-4567 


Name 


ssn 


sal 


Phone 


Marilyn 


321-54-9876 


78000 


487-0128 


Name 


city 




HQ 


San Jose 


Name 


ssn 


sal 


Phone 


Betsy . 


233-23-6352 


75000 


234-3473 


Name 


loc 


mgr 




Sales 


HQ 


Betsy 



The same operator can be expressed as: 

O <Emp/(sal>50000)|Dept/(mgr="Betsy")> (EmpsDcptsSites) 

The select operator operates only on the top level, in that it decides for each top level record 
whether to keep it or toss it. However, the condition c can look deep inside the current record 
{^'g', by using existential/universal quantifiers). 



2.3 Rename 

15 The rename operator (p) renames tags and/or attributes. It is a generalization of the rename 
operator in the relational algebra: 



Defn. Renaming p-former: 

A renaming p-former is an expression: 
p = <tagi^tagi':[aii-^aii', ai2-^ai2, ... ] | tag2^tag2':[a2i^a2i', a22^a22, ... ] | ... > 

20 

Defn. Renaming operator: 

Given a renaming p-former p, the renaming operator is: 

Pp({t})={t'}. 

This operator renames tagi to tag'i, and renames the fields in the record of type tagi by 
25 changing an to an*, . . . ; renames tag2 to tag2', and so on. All tags and/or labels that are 
not mentioned are left unchanged. The output type is "isomorphic" to the input type: 
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only the variant and record labels at the top level have changed. The "identity" mapping 
ai -> ai can be abbreviated to ai in specifying a renaming. 

2.4 Extend 

5 The extend operator (e) adds new fields to a record, each computed by some expression from 
other fields. 

Defn. Extension p-former: 

Given union types u = <tagi:ri,...,tagn:rn> and u* = <tagi:ri tagn:rn'> an extension p- 
former of type u ^ u* is an expression: 
10 P = <tagi:[cii:eii,ci2:ei2, ... ] | ... | tag„:[Cni:eni, Cn2:en2, ... ]> 

where the expressions eij, . . . have types Cy : ri ty', and Vi is a record type obtained by 
adding the fields [Cii:tii*, Ci2:ti2',. . .] to 

Defn, Extension operator: 

Given an extension p-former p, the extension operator is: 
15 Bp({t})={t'}. 

The meaning is that new labels cn, Cn, ... are added to the corresponding records, and 
their values are computed by the exrepssions en, e^, . . . The values of all the other labels 
are copied into the output unchanged. As for the other operators, not all tags need to be 
mentioned: the missing ones are copied to the output. 

20 

2.5 Combo 

The Combo operator (Z) combines Project, Select, Rename, and Extend, and does this to 
arbitrary nesting levels. The Combo operator is parameterized by an argument that is a deeply 
25 structured expression combining arguments of Project (IT), Select (a). Rename (p), and 

Extend (e). Such an expression is called a p-former. Unlike the other operators, combo does 
not implicitly copy "the other" tags and labels to the output, but deletes them. This allows 
both copying and deleting. (For convenience, another version of the combo operator may also 
be defined that copies by default.) 

30 The rules below define a p-former inductively. A p-former, p, has an input type t and an 
output type and the p-former is denoted by: 

p :: t -> V 

The p-former takes an input value of type t and returns either an output value of type t' or 
"nothing." When used in a Combo operator, Ip, the type of the Combo operation is (t} — > 
35 {t*}. Consider a selection in the relational algebra, a age<20 , (age<20) is a p-former: it takes a 
record and returns either the same record (if age<20) or nothing (otherwise). 
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For each p-former, there is a context (which, recall, is a sequence of record types, (ri, . . . , rn)). 
Inner p-formers have a context consisting of all record types of the surrounding records. The 
top-level p-former has an empty context. 

1. The "identity p-former": 

_ ;.' / — > / for every type / 
returns its input, unchanged 

2. The "record p-former": 



10 



if pi :: ti^ ->t\, pk :: t;^ ^ t'^, are p-formers, {ii, ik} e {1, 2, n} and 

ei: [ai:ti,...,an:tn] -^ti", Cp: [ai:ti,...,an:tn] ^tp" are expressions, then 
[(ai, ->b,):pi,...,(ai^ -^\)\^^,c, :ei,...,Cp :ep] ::r^ r', where r = 
[ai:ti,...,an:tn]andr'= [bi:ti',..., bktorti", ...,Cp:tp"]. 

The (a -> b) : p components rename an attribute a to b, and apply recursively the 
15 p-former p on the value of that attribute. The c:e components introduce a new 

attribute (c) whose value is computed by the expression e over r. If the context of 
p is (ri, . . . , rn), then the context for each of pi, . . . , pk is (ri, . . . , rn, r). 
The following condition applies to p-formers in Combo operators: 

20 (*) a. are distinct and ej, ... , ep are scalar expressions 

(The b's and c*s are also distinct, which follows implicitly from their use in the 
type [bi :ti', . . . , bk:tk',ci :ti", . . . ,Cp:tp"]). The restriction ensures that every complex 
value is copied to the output at most once, and new values being produced are 

25 scalar: this enables simple, pipeline computation. 

Given an input value [ai:vi, . . ., an : Vn], the p-formers pi, . . pk first apply to the 
values Vii, . . . , Vik respectively. If any of them returns "nothing," then the record 
p-former returns "nothing." Otherwise, let vi', , . ., Vk be the values returned by 
the p-formers, and let wi, . . ., Wp be the values returned by the expressions ei, . . . , 

30 Cp: the record p-former returns the record [bi:vi*, , . ., bk : Vn,Ci:wi,. ..,Cp,:wp]. 

3. The "variant p-former" is parameterized by a selection condition, c: 

if c : r bool and p : : r r' 
35 then (tag/c tag'):p : : tag:r tag':r' 

The condition c is checked first. If it returns false, then the p-former returns 
"nothing." Otherwise, it returns whatever p returns, but changes the tag to tag'. 
The c may use existential/universal quantifiers on the set fields of r. 



40 



4. The "union" p-former is: 

ifpi::vii ^ vi', Pk :: Vik ^ Vk' and {ii, ik} c {1, 2, n} 
then<pi| ... |pk> :: <vi | ... |vn> ^ <vil... | Vk*> 
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Given a value of type <vi | . . , | Vk> , union p-former checks that it is of one of the 
variant types Vii,. . . ,Vik: if not, it returns "nothing," otherwise it returns whatever 
the corresponding p-former returns. Neither the variant types vi ,. . . , v^ nor the 
variant types vi',. • . , Vk are disjoint: however identical tags in the latter 
5 correspond to identical types, i.e., <vi'|. . , | Vk> are a correctly formed type. 

5, The "set" p-former is: 

if p :: u ^ u* 

then {p} :: {u} -> {u'} 

10 

Given a set (xi, . . Xn}, the set p-former first applies the p-former p to each 
element in the set. Let yi, . . . , yk be all values returned by p (i.e., excluding 
"nothing"), then the set p-former returns {yi, . yk) 

15 Defn, Combo operator: if/? is a p-former, p :: t then: 

2p({t})=({t'}) 

Given a set {xi, . . Xn}, the operator first applies the p-former p to each element 
in the set. Let yi, . . . , yk be all values returned by p {i.e. , excluding "nothing"). 
Then S p returns {yi, . . . , yk} 

20 

The following p-former is defined for the EmpsDeptsSites table: 

p = <Emp/(sal>50000) ^HighEamer: [name^name: J | 
Dept/(mgi^ "Betsy") ->BetsyManages:[name^name: J | 
Site/(true) -^Site:[name-^name:^city^city:_|> 

25 The following combo operation is applied to the table: 
Zp (EmpsDeptsSites) 

If (a — > a) is abbreviated with a, p:_ with p, and tag/(true) with tag, then combo operator can 
30 then be abbreviated as: 

Z<Emp/(sal>50000)->HighEamer:[name] | I>ept/(ingr= "Betsy")->Bet5yManages:[iiame] | Site:[naine,dty]> (EmpsDeptsSites) 

The results of this combo operator is: 
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10 



15 



20 



HighEamer: 
HighEamer: 
Site: 

HighEamer: 
BetsyManages: 



name 




Bob 


name 


Marilyn 


name 




HQ 


San Jose 


name 




Betsy 


name 


Sales 



25 



This example illustrates the use of contexts. The following relation (base types omitted) is 
used. 

Pers:{<a:[name, birthday, projects: {<b: [title, deadline, modules: {<c:[id,date]}]>}]>} 
Each person has a set of projects, each project has a set of modules. 

a. The combo Zp2(Pers) returns all persons that work on a project whose deadline is on 
their birthday. Its definition needs two p-formers: 

pi = <b/($l.deadline=$[l].birdiday):_> 
p2 = <a/(Exists(Epi(gl.projects)):_> 

The context expression $[1] .birthday in pi, which retrieves the birthday firom one level 
higher up, is used in the "context" of p2, which defines the outer context to be a record 
of tjrpe [name, birthday, project]. 

b. The combo Zp3(Pers) deletes firom every person all projects whose deadline is on the 
person's birthday: 

p3 = <a:[name:^ birthday:,, projects: {<b/($l. deadline != $[1]. birthday) :_>}]> 

c. This illustrates the use of a $[2] context. The combo Sp4(Pers) operates as before, but in 
addition it deletes all modules whose dates are on the person's birthday: 

p4 = <a:[name:^ 

birdiday:_j 

projects: {<b/($l. deadline != $[1] birthday): 
[title:^ 
deadline:^ 

modules: {<c/($l.date != $[2].birdiday):„ >} 

]> 



30 



1 . A Projection operator is a particular case of the Combo operator. For example 

TI<£inp:[name] | Dept: [name.mgr] | Site: [name,dty]> (EmpsDeptsSites) is the same as 
^<Emp:[iiaine] | Dept: [naine,mgr] | Site: (name,city]> (EmpsDeptsSites) 

2. A Selection operator is a particular case of the Combo operator. For example 

CJ<Emp/(sal>50000)|Dept/(mgr="Betsy")| Site/(true)> (EmpsDeptsSitCS) is the Same aS 
2^<Enip /(sal>50000) I Dept / (mgr="Betsy") | Site / (true)> (EmpsDeptsSltCS) 
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3, A Renaming operator is a particular case of the Combo operator. For example, 
given the renaming p-former: 

p = <Einp^Employee:[name-^person,ssn,saI,phone— >contact]] | 
Dept->Depart: [name->teaminember,loc^place,mgr] | 
5 Site^Site:[name,city]> 
the renaming operator 

pp(EmpsDeptsSites) 
is the same as 

Ep(EmpsDeptsSites). 

10 4. The most general form of a Combo operator is like a submatrix selection. There 
is no copying and no unnesting involved. 
5. Combo can be used to homogenize a collection. For example, in 

EmpsDeptsSites there are three different kinds of records, and all share a name 
attribute. The following Combo operator extracts all names and constructs a 

15 homogeneous collection: 

2<Emp-»Res:[name] | Dept-^Res: [name] | Site^Res: [name]>(EmpsDeptsSites). 

The resuh is of type 

{<Res:[name:string]>) 
which is a homogeneous collection. 
20 6. Combo can be used to dispatch records to different types (e.g., transforming a 

homogeneous collection into a heterogeneous one). The following Combo 
operator splits Emp's into Regular and HighPaid: 

S<Emp/(sal<100k)-»Regular:[name, phone] | Emp/(saI>100k)->ffighPaid: [name,phone]>(Einps) 

The input has type 
25 {<Employee:[name, phone, salary]>}, 

(base types omitted) while the output has type 

{<Regular:[name] | HighPaid: [name, phone]>} 

In general conditions that are applied to a tag may overlap. For example, 

employees with saKlOOk may be dispatched to some type, and those with 
30 sal>50k to another type. In this case, records that satisfy both conditions will 

contribute to both outputs. 



2.6 The Simple Combo 

35 The semantics of combo is that tags not mentioned in the p-former are dropped from the 
output. The "simple" combo has a complementary semantics: only tags/labels that are to 
be modified need be mentioned. By default, all others are copied to the output. 
Moreover, the "simple" combo only does one single action, possible at some depth in the 
NCR. 

40 Defn. Simple p-former: 

A simple p-former is a p-former that includes only a single selection (i.e., for a single 
tag), a single projection, renaming of a single tag or label, or an extension with a single 
new label. Formally, it is defined like a p-former with additional syntactic restrictions 
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that ensure that only one action is performed (only a projection, a selection, a renaming, 
or an extension). The other parts of the simple p-former, that do the copying, are omitted. 
The simple p-former is illustrated by the following examples: 



• <Emp/(sai>=50000)> this simple p-former is a selection and is equivalent to 
<Emp/(saI>=50000):_ | Dept:_ | Other:_>, />., copy the other tags unchanged. 
Thus, to get the equivalent "real" p-former, the omitted tags Dept and Other need 
to be added. 

10 • <Dept:[name, projects]> this simple p-former is a projection and is equivalent to 
<Emp:_ I Dept:[name:_, projects:_J, Other:_>. 

• <Dept:[projects:{<urgent/(deadline>="10/10/2010")>}]> this is a selection on 
the inner relation projects, and is equivalent to 

<Emp:_ I Dept:[name:_, floor:_, projects: {<norniaI:_ | 
15 urgent/(deadiine>="10/10/2010"):_>}] | Other:_>. 

• Omitted tags (Emp, normal) are added and omitted labels (name, floor) are 
added too. 

• <Dept[projects:{<urgent:[name, team]>}]> this is a projection at a deeper 
level. Tags and labels are added, except v^here projection is done: 

20 <Emp:_ I Dept:[name:_, floor:_, projects: {<normal:[name:_, team: J, 

urgent:_>}] | Other:_> 



Defn, Simple combo: 

Given a simple p-former p, the simple combo operator is: 

25 i:V{t})={f}. 

The superscript 0 indicates that the combo is "simple" (/.e., all missing tags and labels in 
p have to be added). The semantics is defined as follows. Given a p-former, p::t -> 
t',(simple or not) define the completion of p to be c(p) : t -> t" obtained fi-om p by 
"completing" the missing tags according to the type t (/. e., c(p) should actually be 
30 denoted c(p,t); it can be defined inductively; omitted). The semantics of the simple 
combo operator is: 



S^(x) = Ic(p)(x) 



Example 

Given NCR EmpsDeptsSites fi"om above, the following are simple combos: 
2^°<Emp/(sai>5oooo):j> (EmpsDcptsSitcs) is the same as: 

5^<Emp/(sal>50000)]:_ I I>eiit:_>(EmpsDeptsSites) 



40 
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2*^<i>cpt[mgr]> (EmpsDeptsSites) is the same as: 
^<Depc[n^ I Emp:_> (EmpsDeptsSitcs) 

The simple combos are strictly less powerfiil than combos, because they do not allow 
deletion of tags from the output type. For example, consider the combo 
5 S<Dcpt:[mgr]>(EmpsDeptsSites). Its output type is {<Dept: [mgr]>}. It may be expressed 
as a sequence of simple combos: 

S^<Emp/false:_> (S^<Dept:[ingr]> (EmpsDeptsSitCS)) 

10 where the second simple combo is needed to eliminate all Emp records. But, the output 
type of this expression is {<Dept:[mgr] | Emp:[naiiie, ssn, sal, phone]>}. A type 
checker could be modified to recognize that some conditions are (equivalent to) false and 
eliminate the corresponding tag from the output type. If so, combo operators could be 
expressed as a sequence of simple combo operators. 

15 

2.7 Match 

The match operator, Qm, generalizes the combo operator by relaxing some of its restrictions. 
The match operator is parameterized by an m-former, m, that is defined inductively like a p- 
former in Combo, with two generalizations: 

20 1 . In the record m-former: 

[(a^^ ->bi):pi^,...,(ai^ ^ b J : p^^c^ : e,,...,Cp : ej :: [ai:ti,...,an:tn] ^ [bi:ti\..., 
bk:tk',ci:ti", ...,Cp:tp"] 

the labels aii, . . . , aik are not required to be distinct, and the expressions ei, . . . , ep 
are not restricted to be scalars. (That is the restriction (*) is dropped.) This allows 
25 data values to be copied. 

Example: n<Pers:[Name->Namel,Phone,Name^Name2]>, COpieS the Name ValuC Calling it 

' Namel and Name2. Such copying can be expensive, when the value being 

copied is a large sub-relation. This is unlike Combo where no copying is done. 
2. There exists an "unnest" m-former, with no corresponding p-former: 

30 ifm :: u^u* 

then unnest(m) :: {u} u' 

(This is different from the set p-former, {p} : {u} -» {u'}). Unnest flattens an 
inner relation. When two or more unnest m-formers are used in a record m- 
former, their result consists of a Cartesian product: again, this can be expensive. 

35 

Whenever an unnest m-former is used inside another m-former m :: t ^ t', the resuh type t' is 
not a legal type any more, but an "extended" type, t' can be converted back into a legal type 
using a mapping norm. 

40 Defn. Match operator: if m :: t ^ t' is a m-former, where t is a type and t' an extended type. 
Then: 
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Q„,({t}) = {nonii(tO} 
Example: 

f^:[A:iuinestO,B:unnestO]> ({<T:[ A: {<Ti ! [a:tii,b:ti2] | T2:[c:t 13, d:ti4]>}, 

5 B:{<T3:[e:t2i, f:t22] | T4:[g:t 23, h:t24]>}] 

>}) = 

{<T:[ A:<Ti:[a:tii,b:ti2] | T2:[c:t d:U,]> 
B:<T3:[e:t2i, f:t22] | T4:[g:t 23, h:t24]>] 

>}) 

10 

The only change in the output type is that some braces { . . . } have been erased. The output 
type is technically illegal in the type system, because each record field needs to be either 
atomic or a set. It can be normalized by pulling out all variant types to the top level and 
flattening the records. The normalized type is defined to be the output type of Q. We call 
15 norm the normalization operation. The following example illustrates the normalization 
operation: 

norm <T:[ A:<Ti:[a:tii,b:ti2] | T2:[c:t 13, d:ti4]> B:<T3:[e:t2i, f t22] | T4:[g:t 13, h:ti4]>]> 

= <TTiT3:[Aa:tiuAb:ti2Be:t2i3f:t22] | 
TTiT4:[Aa:tn,Ab:ti2Bg:t23,Bh:t24] | 
20 TT2T3:[Ac:ti3,Ad:ti4Be:t2i,Bf:t22] | 

TT2T4:[Ac:ti3,Ad:ti4Bg:t23,Bh:t24]> 

The norm operation constructs new tag names and new field names by concatenating existing 
names. 

25 Definition of norm. The symbol "®" is used to denote the following operation between 
record and/or union types: 

[aii:tii, ai2:ti2, aimitim] ® [a2i:t2i, a22:t22, a2n:t2n] = 
[aii:tii, ai2;ti2, aim^tim, a2i:t2i, Biji'Xii, . a2n:t2n] 

30 r® <tagi : ri |tag2 : r2| ... | tagn:rn> = <tagi : r® n | tag2 : r® r2| ... | tagn:r® rn> 

<tagi : ri|tag2 : r2| ... |tag„:rn>®r = <tagi : ri ® r | tag2 : r2®r | ... | tagn:rn® r> 

<tagii : ru I tagn : ri2 1 . . . | tagin,:rim> ® <tag2i : r2i | tag22 : r22 1 ■ • • | tag2n:r2n> = 
35 < tagiitag2j : rii ® r2j I i = 1, m, j = l,n > 

In the last line tag concatenation is used. The norm operation concatenates two record types, 
or, if the two types are unions of m and n record types respectively, then constructs a new 
union type with mn record types, by considering all pairs of concatenations. 

40 The symbol "©" is used to denote the union of two disjoint union types: 
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<tagi I : ri 1 1 tagu : | . . . | tagim:rim> ® <tag2i : T21 1 tag22 : r22 1 . . . | tag2n:r2n> = 
<tagu : rn I ... I tagim:rim| tag2i : r2i | ... | tag2n:r2n> 

Two functions rlabeU and ulabeltag. are defined that add or concatenate a record label a, or a 
5 tag to a type: 

rlabela(t) = [a:t] where t is an atomic type or a set type 
rlabela([ai:ti, a2:t2, an:t„]) = [aai:ti, aa2:t2, aan:tn] 
rlabela(<tagi:ri | tag2:r2 1 . . . | tag„:r„>) = 

<tagi:rlabela(ri) | tag2.riabela(r2) | ... |tag„:rlabela(rn)> 
10 ulabeltag(r) = <tag:r> where r is a record type 

ulabeltag(<tagi:ri | tag2:r2 1 . . . | tagn:rn>) = <tagtagi:ri | tagtag2:r2 1 . . . | tagtagn:rn> 

For the definition of the fiinction norm, the definition of types ("normal types") is extended to 
" extended types ." These are precisely the types t' that can occur in m-formers, m : : t — > t'. 

15 t:: = b 
t:: = {u} 

t :: = u /* this is an extended type and "illegal" under the normal definition */ 

r ::=[ai:ti, a2:t2, ...,an:tn] 

u :: =<tagi : ri|tag2 : r2| ... |tagn:rn> 

20 

The norm operation is defined by: 

norin(b) = b 
25 norin({u}) = {norm(u)} 

norm([ai:ti, a2:t2, . . anitj) = riabelai(norni(ti)) ® . . . O rlabeIan(nonn(tn)) 
nonn(<tagi:ri | tag2:r2 1. . . | tagn:rn>) = uIabeltagi(nonn(r i) ) © . . . © ulabeltagn(norin(rn) ) 

The operation norm(t) resuhs in a normal type; and if t is a normal type then norm(t) = t. 
2.8 Join 

30 Join takes a tuple from each input table and tests this combination to see if it meets a predicate; 
if so, it returns a combined tuple as the result. Join allows specification of a different 
predicate and name and a different output tag for each pair wise combination of types from the 
two tables. Not all pairs must have a predicate and output tag: those that do not contribute to 
the join. 

35 An "inner" join is described in the following. 
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Defh. Conditional predicate on two tuples: 

The Join and Nest operations need conditions on two records: 
c : n X Tj ^ bool 

Defn. Combined tuple: 

Join produces a combined tuple as its output Recall the "0" operator^ which 

returns a new combined tuple. 

rl (8) r2 = [ah: tli, ahith, . . aln: tin, a2i: t2i, . . a2n,: t2„,] where 
rl = [ali: th, abith, aln: tin], r2 - [a2i: t2i, a2n,: t2n,] 

(There is an implicit condition here that the labels in the two records are disjoint.) 

Defn, j-former: 

A j-former of type: 

j :: <tagli:rli | ... | tagln:rln> <tag2i:r2i| ...| tag2n,:r2n,> 
<tag\:rl. 0r2j^ | ... |tag'p : rl-^ (8)r2^^ > 

is an expression of the form: 

j = <tagli^,tag2. /ci^tag\ | ... | tagli^,tag2jyCp -> tag;> 

where: 

1. ii, ...,ipG {1, ...,n}, ji, ...jpG {1, ...,m} such that all pairs (iiji), 
(ip. jp) are distinct (hence p <n^), and all tags tag'i, . . . , tag'p are distinct 

2. : rl-^ X r2j^ -> bool , for k = 1, . . p. 

Defn. Join operator: 

Let j : : 1 1 X t2 ^ t be a j-former. Then the join operator is: 
[Xlj({tl},{t2})= {t} 



30 



2.8.1,1.1,1 Example 

The following NCRs are used to illustrate a join of two tables: 
2.8.1.1.1.1.1 Depts 



Dept: 
Dept: 
Dept: 
Dept: 



dname 


loc 


Payroll 


HQ 


dname 


loc 


Quality Ctrl 


HQ 


dname 


loc 


Sales 


HQ 


dname 


loc 


Personnel 


Satellite 
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2. SAAAA.2 EmpSites 



Emp: 

Emp: 

Site: 

Emp: 

Emp: 

Emp: 

Emp: 



endtne 


ssn 


dept 


U\JU 




T^a vrr\ll 
JT <iyi uii 


A ^ fv^ a 

enamQ 


ssn 


dept 


ivxcu 11 y 11 






Sfiame 


city 




HQ 


San Jose 


ename 


ssn 


dept 


Qing 


673-82-3845 


Quality Ctrl 


ename 


ssn 


dept 


Betsy 


233-23-6352 


Sales 


ename 


ssn 


dept 


Brian 


341-69-0323 


Sales 


ename 


ssn 


dept 


Sam 


356-02-6743 


Payroll 



A join on the two tables can be performed with one join predicate for (Dept x Emp) and 
another for (Dept x Site): 

<Emp,Dept/($l.dept=$2.dname)->EmpLoc | Site,Dept /($l.loc=$2.snaine)->FuIlDept> (EmpSites, Depts) 

returns the following NCR: 



EmpLoc: 
EmpLoc: 
FullDept: 
FullDept: 
FuliDept: 
EmpLoc: 
EmpLoc: 
EmpLoc: 
EmpLoc: 



ename 


ssn 


dept 


dname 


loc 


Bob 


123-45-6789 


Payroll 


Payroll 


HQ 


ename 


ssn 


dept 


dname 


loc 


Marilyn 


321-54-9876 


Quality Ctrl 


Quality Ctrl 


HQ 


sname 


city 


dname 


loc 




HQ 


San Jose 


Payroll 


HQ 


sname 


city 


dname 


loc 


HQ 


San Jose 


Quality Ctrl 


HQ 


sname 


city 


dname 


loc 


HQ 


San Jose 


Sales 


HQ 


ename 


ssn 


dept 


dname 


loc 


Qing 


673-82-3845 


Quality Ctrl 


Quality Ctrl 


HQ 


ename 


ssn 


dept 


dname 


loc 


Betsy 


233-23-6352 


Sales 


Sales 


HQ 


ename 


ssn 


dept 


dname 


loc 


Brian 


341-69-0323 


Sales 


Sales 


HQ 


ename 


ssn 


dept 


dname 


loc 


Sam 


356-02-6743 


Payroll 


Payroll 


HQ 



2.9 Outer Join 



An outer join has the same syntax as an inner join, with minor additional restrictions. 
The semantics differs: 
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35 



45 




Defn, oj-former: 

Lettl = <tagli:rli | ... | tagln:rln> and t2 = < tag2i:r2i| . . . | tag2ni:r2m>. An oj- 
former for type: 

oj:: tl,t2^tl e t2e <tag\ : rl^^ ® r2.^ | ... |tag'p : rl;^ ® r2j^ > 

5 is an expression of the form: 

oj = < tagl- , tag2. /c, tag\ | . . . | tagl^^ , tag2j^ /c^ -> tag; > 

where: 

1. ii, ...,ip G {1, n}, ji, ...jp E {1, m} such that all pairs (iiji), 
(ip, jp) are distinct (hence p <mn), and all tags tag*i, . . , , tag'p are distinct 
10 2. Ck :rli^ xr23^ ^bool, for k= 1, p. 

3. all tags tag'i, . . tag'p are distinct, and they are also distinct both from 
tagl 1, ... , tagln and from tag2i, . . . , tag2m 

Defn. Outer Join operator: 

Let oj :: tl X t2 ^ t be an oj-former. The outer-join operator is: 
□ [>^0 oj({tl}, {t2})= {t} 

Example 

20 The outer join is illustrated by a data integration scenario. There are two sources of 

persons' phone numbers where each source has an attribute that tells us the confidence in 
that piece of information: 

Sourcel: (<sourcel:[namel: string, phonel int, confl:real]>} 
25 Source2: {<source2:[name2: string, phone2:int, conf2:real]>} 

The integration is done in two steps. First, the outer join is computed, then a 
selection/project that encapsulates the logic of the integration is applied. The first step 
resuhs in the raw data: 



rawIntegratedData = Sourcel □ □ oj Source2 

where the oj-former is: 

oj = <sourcel, source2/($l.namel=$2.name2) -> integrated > 

40 The type of rawIntegratedData is: 

rawIntegratedData :{<sourcel:[namel, phonel, confl] 

I source2:[name2, phone2, confZ] 

I integrated: [namel,name2,phonel,phone2,conn,conf2]>} 



That is, it will include all "integrated" records, as well as "dangling tuples" from each 
source. The second step can now apply an arbitrarily sophisticated integration logic. For 
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example, source 1 may be tmsted more than source 2, have some complex rules on how 
to deal with conflicting information, and drop records where the confidence is too low: 

integratedData = Zp(rawIntegratedData) 

5 

where the p-former p encapsulates the integration logic: 

p = <sourcel/(confl > 0.1) — > certain :[nainel — > name, phonel — > phone] 
I source2(conf2 > 0.5) -> certain :[name2 — > name, phone2 phone] 
10 I integrated/(confl > 0.4) -> certain:[namel ^ name, phonel phone] 

I integrated/(confl<0 4 and conf2>0.7) 

certain:[namel— >^name, phone2->phone] 
I Integra ted/(confl in O.L.0.4 and confZ in 0.5..0.7) -> 

uncertain:[namel -> name, phonel, phone2] > 

15 

The type of integratedData is: 

integratedData: {<certain:[name, phone], uncertain: [name, phonel, phone2]>} 

20 that is, in some cases both phone numbers are kept since the confidence does not favor 
one over the other. 

2.10 Nest 

The Nest operator works like a left outer join, but it nests all matching children within the 
25 tuple of each parent. Nesting is commonly done in XML-QL subqueries, and any time there 
is a 7.72 parent-child hierarchy in the output. Nest has left outer join semantics, rather than 
inner join semantics, and that it preserves the order of the parent relation. Nest can rename 
both the tag types for the nested tuples and for the parent tuples. 

Defn. n-former. 

30 Given n x m predicates: 

Cy = rli X r2j — > bool 
and n new labels, bi, . . . , bn. . An n-former of type: 

n :: <tagli : rli | ... | tagln : rln> <tag2i: r2i| ... | tag2m: r2n,> 
<tagi:ri(g) [bi:{u2}] | ... | tag „ : fn ® [bn:{u2}]> 
35 is given by an expression: 

n = <tagii : [bi : <tag2i/cii | . . . | tag2m/ciin>] | i = 1, . . n> 

Defn. Nest operator: 

Let n :; ui, U2 — > u be an n-former. Then 
Nestn:({ui}, {u2}) = {u} 
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Example This example uses same Depts and EmpsSites relations of types: 
Depts:{<Dept:[dname, Ioc]>} 

EmpsSites: {<Emp: [ename, ssn, dept] | Site: [sname, city]>} 
and performs the operation: 

Nest <Dept:[info: <Emp/($l.diiaine=S2.dept) | Site/($Moc = S2.siiame)>]>(DeptS, EmpsSites) 

The result has type: 

{<Dept:[dname, loc, info:{<Emp: [ename, ssn, dept] | Site:[sname, city]>}]>} 
and is depicted below: 



dname 



loc 



info 



Payroll 



HQ 



Dept: 



dept 



2.1LLLLLL1 Emp: 



Bob 



123- 

45- 

6789 



Payroll 



sname 



city 



Site: 



Emp: 



HQ 



San 
Jose 



dept 



Sam 



356- 

02- 

6743 



Payroll 



dname 



loc 



info 



Quality Ctrl 



HQ 



Dept: 



dept 



2ALL1.LLL2 Emp: 



Marilyn 



321- 

54- 

9876 



Quality 
Ctrl 



city 



Site: 



HQ 



San 
Jose 



sname 



ssn 



dept 



2ALLLLLL3 Emp: 



Qing 



673- 

82- 

3845 



Quality 
Ctrl 



dname 



loc 



info 





sname 


city 






HQ 


San Jose 






ename 


ssn 


dept 


2.1LLLLLL4 Emp: 


Betsy 


233-23- 
6352 


Sales 




ename 


ssn 


dept 


2JLLLLLL5 Emp: 


Brian 


341-69- 
0323 


Sales 



Sales 



HQ 



Dept: 



dname 



loc 



info 



Personnel 



Satellite 



10 
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2.11 Union 

The union of two union types, ui © U2, is defined to consist of all tags in both types, provided 
that a tag occurring in both ui and U2 has identical types in ui and U2. 



10 



Defn> Union operator: 

U({<rii, ri2, ...,rin>}, {<rji, rj2, ...,rjn,>} = 
{<rii, ri2, rin> © <rji, rj2, rjnt>} 
Example 

The union operator is illustrated using the Depts and EmpsSites tables from the 

join example. The operation: 
U (Depts, EmpsSites) 
resuhs in the NCR: 



Dept: 

Dept: 

Dept: 

Dept: 

Emp: 

Emp: 

Site: 

Emp: 

Emp: 

Emp: 

Emp: 





loc 




Payroll 


HQ 


dname 


loc 


Quality Ctrl 


HQ 


dname 


loc 


Sales 


HQ 


name 


loc 


Personnel 


Satellite 


ename 


ssn 


sal 


Bob 


123-45-6789 


Payroll 


ename 


ssn 


sal 


Marilyn 


321-54-9876 


Quality Ctrl 


sname 


city 




HQ 


San Jose 


ename 


ssn 


sal 


Qing 


673-82-3845 


Quality Ctrl 


ename 


ssn 


sal 


Betsy 


233-23-6352 


Sales 


ename 


ssn 


sal 


Brian 


341-69-0323 


Sales 


ename 


ssn 


sal 


Sam 


356-02-6743 


Payroll 



The expression A U B, for various types of A and B is illustrated in the following: 
15 1. If A:{<Emp:[name, phone]>}, B:{<Dept:[name, floor]>}, then AUB has type 

{<Emp:[name, phone] | Dept:[name, floor]>} and denotes their disjoint union 
(i.e., no duplicates are introduced, and duplicate elimination is not needed). 
2. If A:{<Emp:[name, phone] | Mngr:[name, beeper]>} and 

B:{<Dept:[name, floor] | Mngr:[name, beeper]>} then AUB has type 
20 {<Emp:[name, phone] | Mngr:[name, beeper] | Dept:[name, floor]>} and 

means: take the disjoint union of Emp*s from A and Dept's from B, and take the 
regular union of Mngr's from both A and B. The output type is: 
(<Emp:[name, phone] | Mngr:[name, beeper] | Emp:[name, phone]>}. We 
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need to do duplicate elimination on Mngr's. If the type of Mngr in B is changed, 
such that the types of Mngr in A and B do not coincide any more, then A U B is 
illegal. 

3. If both A and B have the same type, say {<Emp:[name, phone] | Mngr: [name, 
5 beeper]>}, then A U B is the regular union and the output type is the same. 

2.11 Distinct 

The distinct operator removes duplicates; 

10 distinct: {t} {t} 

2.12 Aggregates 

Aggregate operators are included in the combo operator. The five aggregates in SQL 
and: count, sum, min, max, avg. Count is treated slightly differently. Let agg be one of 
15 sum, min, max, avg, and let b be the base type to v^hich it applies (b can be int, real, or 
string when agg is min or max, int or real when agg is sum, and real only when agg is 
avg). Consider a union type: 

u = <tagi : ri | ... | tagn : rn> 

20 

and let ei : n -> b, . . en : rn ^ b be expressions. Then the following is an expression: 
agg<tagi/ei | ... tagn/en> : {u} b 
25 Expressions can be used in combo operators. 

Example. The following example uses the NCR: 
Products: {<Indigenous:[name, quantity, category], Imported: [n,q,c]>} 

30 

Where n,q,c stand also for name, quantity, category. The computation of the total 
quantities for all categories is performed in three steps: First, all categories are 
computed: 

35 Cat = distinCt(E <lndigenoiis Cat [category ^ c] | Imported -> Cat: [cl>(ProduCts)) 

The type is {<Cat:[c]>}. Second, the products are nested by categories: 

Groups = Nest <Cat:[Prods:<Indigenous/($l.c=$2.category)|Imported/($l.c=$2.c)>(Cat, ProduCts) 

40 

The type is {<Cat:[c,Prods:{<Indigenous:[...], Imported:[.., ]>}]>}. 
Third, the sum of all quantities is computed: 
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Answer - £ <Cat:[c,total:siim<Indigenous/quantlty |Imported/q>(Prods)]>(GrOUps) 

The type is {<Cat:[c, total]>}. 

5 

3 NCR-QL 

The following illustrates how XML-QL could be mapped into the algebra. However, XML- 
QL works over XML data, not NCRs, so an NCR version is defined of XML-QL ("NCR- 
QL"), that works on NCRs. 

10 The analogy between XML-QL and NCR-QL is the following: 

XML-QL : where-construct 

NCR-QL : 6*0 m-case- where-construct 

15 3.1 Query 1 

EnnpsDeptsSites is the relation defined earlier containing tuples about Employees, Departments, 
and Sites. HOwnersCities is a relation containing tuples about HomeOwners and Cities. The 
following NCR-QL query performs some join between the two: 

From EmpsDeptsSites, HOwnersCities 
20 Case (Emp:[name: $X, ssn: $Y, sal: $Z, phone: $U], HOwner:[lastname:$V, zip:$W]) : 
(Where $X=$V AND $Z > 100000 
Construct EHO : [name: $X, ssn: Y,zip$W]) 
I (Dept:[name:$X, loc: $Y, mgr: $Z], HOwner:[lastname:$V, zip:$W]) : 
(Where $Z=$V 
25 Construct DHO:[name:$Z, dept:$X, zip:$W] 

I (Dept:[name:$X, loc:$Y, mgr:$Z], City:[cityname:$V, place:$W]): 
(Where $Y = $W 
Construct DC:[name:$X,city:$V]) 

30 All combinations of tags fi-om EmpsDeptsSites (defined above) and HOwnersCities (some 
other NCR) listed in the Case statement are inspected, and in each case a different output tag is 
produced. The corresponding algebra expression is: 

Zp(EmpSites [XljDepts) 
where: 

j = <Emp,HOwner/(name=lastname AND sal > 100000)^EHO 
I Dept,HOwner /(mgr=lastname)^DHO 
I Dept, City / (loc=place) DC> 

40 

p = <EHO:[name,ssn,zip], DHO:[mgr— >name, name^dept,zip], DC:[name,cityname->city]> 
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3.2 Query 2 

This query illustrates patterns over sets and how they are translated into the algebra. It 
5 roughly cx)rresponds to the XML-QL pattern: 

<products> <product> <name> $n </> 

<orders> <order> <date> $d </> </> 
</product> 
</products> IN Products 

10 where <orders> is a set. The NCR-QL query is more powerful since it handles heterogeneous 
collections of products and orders. 

From Products 

Case (SeattleProduct:[name:$n, price:$p, orders:$x]): 
Construct 
15 (From$x 

Case (order: [customer: $c, date:$d]): Where $p<100 

Construct usProductDate: [name:$n,date:$d]) 
I (ParisProduct:[nome$n,prix:$p,orders:$x]): 
Construct 
20 (From $x 

Case (euOrder:[country:$c,date:$d]): Where $p>35 

Construct euProductDate:[name;$n, date:$d] 
I (usOrder:[city:$c,date:$d]): Construct importProductDate:[iiame:$n,date:$d]) 

25 Here Product has type (base types omitted): 

{<SeattleProduct: [name,price,orders: {<order: [customer,date]>}] | 

ParisProduct: [nome,prix,orders: {<euOrder:[country,date] | 
importOrder:[city,date]>}]>} 



30 While XML-QL had a single pattern, nested patterns in NCR-QL are needed to match over 
nested sets. The NCR-QL query is translated into the algebra using a match operator: 

Zp(a^(Products)) 

where: 

35 m = <SeattleProduct:[name:_, price:_, orders:unnest{<order:[customer:_, date:J>}] 
I ParisProduct: [nome:_, prix:_, orders: un nest{<euOrder: [count ry:_, date: J 
I usOrder:[city:_, date:J>}]> 

Note that QmCProducts) has type: 
40 {<SeattleProduct: [naine,price,orders:<order: [customer,date]>] | 
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ParisProduct: [nome,prix,orders:<euOrder:[country,date] | 
importOrder:[city,date]>]>} 



Which is the same type as for Products, with inner braces erased. It normalizes to: 

5 {<SeattleProduct.order:[name, price, orders. customer, orders.date] 
I ParisProduct. eu Order :[nome, prix, orders. country, orders.date] 
I ParisProduct. importOrder:[orders.city, orders.date] >} 

Hence, p is: 

10 

<SeattleProduct.order/price<100 -> UsProductDate: [name -> name, orders.date date] 
I ParisProduct. euOrder/prix>3 5 -> EuProductDate: [name name, orders.date -> date] 
I ParisProduct.importOrder -> ImportProductDate: [name name, orders.date -> date]> 



is 3.3 Query 3 

The nesting in NCR-QL is illustrated using subqueries. 
From Products 

Case (product: [id:$x, name:$n, price:$p]: 
(Where $p<100 
20 Construct ProductWithOrders:[name:$n, 

orders: From Orders 

Case (order: [pid:$y, quantity:$q, date:$d]): 
Where $x=$y AND $q>5555 
Construct order: [date: $d] 

25 ] 
) 

The types are (with base types omitted): 

30 

Products: {<product:[id, name, price]>} Orders: {<order:[pid,quantity,date]>} 
The algebra expression equivalent to Query 3 is: 
35 Zp(Nestn(Products, Orders)) 

where n is: 

n = <product: [orders: <order/id=pid>]> 

40 

and p is: 

p=<(product/price<100 ProductWithOrders): [name, orders:{<order/quantity>5555:[date]>}]> 
45 Notice that Nestn(Products, Orders) has type: 
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Nestn(Products, Orders) : {<product:[id, name, price, orders: {<order: [pid, quantity, date]>}]>} 

3.4 General Form of Queries 

5 A general form of algebra expressions for NCR-QL (and, hence, XML-QL) queries is: 

Zp(Nestni(Joini, Nestn2(Join2, ... Nestnk.i(Joink.i, Joink)...))) 
where: 

10 Joini =anii(Rii) Xju aii2(Ri2)DXlii 

Joins = ai2l(R2l) [Xj21 an22(R22)[>^j21 

Joink = anki(Rki)l1X]jki aac2(Rd!>1X]iki ... 

4 Algebraic Laws 

The following three kinds of algebraic laws are formed: 

20 • Push selections and projections down. Selections and projections are captured by 
the Combo operator, Sp, hence laws are needed that commute Combo with other 
operators 

'• Join reordering: associativity, commutativity. 
• Join-nest associativity. 

25 

4.1 Laws that Push Combo (=Selections and Projections) Down 

4.1.1 The Combo-Combo Law 

Let p : tl ^ t2, and q : t2 -> t3 be two p-formers. Then: 

30 Sq(2p(R)) = SrCR) (the combo-combo law) 

Here r is a new p-former defined as r=ppcompose(q,p), by induction on q first, and, where 
needed, by induction on p second. 

/* q = identity */ 

ppcompose(_, p) = p 
35 /* q = record p-former */ 

ppcompose([(bi^ ->Ci):qi, (b^^ ^Ck):qk, Ck+i:ei,Ck+2:e2,..-,Cr:er-k], J = [(\ ^ci):qi, 
(bi^ -^Ck):qk, Ck+i:ei,Qc+2:e2,...,Cr:er-k] 

ppcompose([(bi^ ^ci):qi, (b^^ ^Ck):qk, Ck+i:ei,Ck+2:e2,...,Cr:er.k],[( a^^ ^bi):pi, 

(^im ~^'^s):ps, bs+i:fi,Cs+2:f2, ..,bm:fm-s] = 
40 [(aj. -^Ci):ppcompose(qi, p,.^ ),..., (a^. ->Ct): ppcompose(qk, p,^ ), 

Ci+i:fi^^,_,,Ct+2:fi^^^_,,..-,Ck:fi^.,, 
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Ck+i:ej op, Ck+2:e2 op,..., Cr: e^.j^ op] 

/* here {ji, ... Jm} Q {1, 2, n} and {ii, ik} e {1, 2, m}, with ii < 12 
<. . .< it <= s < it+i < . . . < ik <= m, 

p is the second p-former, p = [( ^bi);pi, (aj^ ^bs):ps, 

5 bs4^ 1 : f 1 , Cs+2 : f2 5 . . ,bm : fm-s] 5 

and eo p means "apply the p-former p first, then e" and is defined below */ 

/* q = variant p-former */ 

ppcompose((tag7c' — > tag"):q, J) = (tag/c tag'):q 
ppcompose((tag'/c' ~> tag"):q, (tag/c tag'):p) = 
10 (tag/(c and c'op)-> tag'):ppcompose(q,p) 

/* here c'^p is defined below and means: apply p first, then check c' */ 

/* q = union p-former */ 

ppconipose(<qi | • • • | qk> , _ ) = <qi | • . • | qk> 

ppcompose(<qi j . . | qk> , <pi | | Pm>) < I ppcompose(qi, pj) | > 
15 /* here each qi is paired with all those pj that have an output tag that matches 
the input tag in qi : according to the definitions there could be one ore more */ 
/* q = set p-former */ 
ppcompose({q},_)= {q} 
ppcompose({q}, {p}) = {ppcompose(q,p)} 

20 

Composition of a p former with an expression, e o p , and a p-former with a condition. 

Cop, are defined. In both cases, the expression (e) and the condition (c) have a single 
record argument, (/.e., only use $1, which, by convention, may be omitted), while p is a 
record p-former, p = [(a- ^ b^) : Pi^,...,(ai^ ^bJiPk^c^ : e,,...,Cp : ej . 

25 

epcompose($l .bj, p) = $1 . a^. 
epcompose($l .Cj, p) = ej 

epconipose(e op e', p) = epcompose(e,p) op epcompose(e',p) where p is -, *, /, 
30 epcompose(f(ei,e2,. . ), p) = f(epcompose(ei,p), epcompose(e2,p), - . ) 

cpconipose(e op e*, p) = epcoinpose(e, p) op epcompose(e',p) where op is < >, <=, 

>= 

, ... 

cpcompose(<tag:[ai:ei, an:en]>INe, p) = <tag:[ai:epcompose(ei,p), an: 
35 epcompose(en,p)]> IN epcompose(e,p) 

cpcompose(exists(e), p) = exists(epcompose(e,p)) 
cpcompose(true, p) = true, cpcompose(false, p) = false 

cpconipose(ci and C2, p) = cpcompose(ci, p) and cpcompose(c2,p) same for or, not 

40 Example, Let p = <Emp/(sal>50000)^HighEarner: [name— >richName:_] | 
Dept/(mgr= "Betsy")^BetsyManages:[name^name:_] | 
Site/(true)^Site:[name— >name:_,city^city:_l> 
Define R= Ip(EmpsDeptsSites) 
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Let q = <HighEamer/(like(richName,"%Smith%")->HighEamer:[rich^ name: J | 

BetsyManages/true Betsy [name name:_] | 
Site/(city=" Seattle") HighEarner:[name — > name: J> 

5 

And define R' = Zq(R) = Zq(Zp(EmpsDeptsSites)). 
Then R' = Eppcompose(q,p)(EmpsDeptsSites), where: 
ppcompose(q,p) = 

<Emp/(sal>50000 and like(name,"%Smith%")) ^ HighEamer:[name ^ name: J | 
10 Dept/(mgr="Betsy" and true) -> Betsy: [name -> name: J | 

Site/(true and city=" Seattle") HighEarner:[name name: _ ]> 



4.1.2 Applications of tlie combo-combo law 

15 1 . Commuting order of selections. Consider an example with simple combo 

operators: 

S <Emp/(sal=20000)>(S <Dept/(mgr-"Smith")>(R)) ~ 
^ <Dept/(mgr="Smith")>( 2 <Enip/(sal=20000)>(R)) 

To prove that we expand the simple combos into combos, then apply the combo- 
20 combo rule. The left hand side becomes: 

S <Emp/(sal=20000)>(2 <I>ept/(mgr-"Smith")>(R)) — 

— E<Emp/(sal=20000) | Dept:_ | Other:_>(2^<Emp:_ 1 Dept/(mgr="Smith") | Other: _>(R)) 
= 2<Emp/(sal=20000) | Dept:/(mgr=" Smith") | Other:_>(R) 

The right hand side is similarly: 

25 S°<Dept/(mgr="Sinith")>( S°<Einp/(sal-20000)>(R)) = 

- S<Emp:_ I Dept/(ingr="Smith") | Other:_>( E<Emp/(sal=20000) | Dept:_ | Other:_>(R)) 
~ 2<Emp/(sal=20000) \ Dept:/(mgr=" Smith") | Other:_>(R) 

Hence the two are equal. 
2. Commuting inner selections. 

30 Z <Dept:[project:{<urgent/(deadline>="10/10/2010") >}]>(Z <Dept:[project:(<urgent/(budget<=10000) 

>}]>(R)) = 

~ ^ <Dept: [project: {<urgent/budget<=10000)>}]>l,^ <Dept: [pro ject:{<urgent/(deadline>=" 10/1 0/20 10") 

>}]>(R)) 

Indeed, take the left hand side and expand the simple combos into combos, then 
35 apply the combo-combo law 

yO /yO 

^ <Dept:[project:{<urgent/(deadline>="10/10/2010") >}]>V^ <Dept: [project: {<urgent/(budget<= 10000) 

>}]>(R)) = 

~ Z<Emp:_ I Dept:[name:_, mgr:_, project: {<urgent/(deadUne>=" 10/1 0/20 10") | normal:_>}]>(Z<Emp:_ | 
Dept:[name:_, mgr:_^ p reject: {<urgent/(budget<= 100 00 ):_ | normal:_>}]> (R)) = 
40 = Z<Emp:_ j D€pt:[name:_, mgr:_, project: {<urgent/(deadlme>="10/10/2010" and budget<- 10000) | 

normal;_>}]> 
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The right hand side is treated similarly and resuhs in the same expression, hence 
they are equal. 

3. Pushing predicates to sources. Consider a selection on the predicate (name like 
"Smith%"). The source supports some predicates, but not this one. Suppose it 
5 supports the predicate (name like "%Smith%"). The predicate is: 

c = (name like "Smith%") 
the source predicate is: 

Cs = (name like "%Smith%") 
The optimizer knows the following implication: 
10 c => Cs 

which is equivalent to: 
c = c and Cs 

Hence it can perform the following optimization: 

^ <Emp/(name Uke "Smith%")> <Emp/(c)> <Emp/(c and cs)> (R) 

15 = Z <Emp/(c)>( S <Emp/(cs)>(R)) 

— ^ <Einp/(name like "Smith%")>( ^ <Emp/(naine like 

"%Smith%")>(R)) 



4.1.3 The Combo-Nest Law 

Let n : : ui, U2 ^ u be an n-former and p : : u ^ u* be a p-former. Then the following 
holds: 

25 Zp(Nestn(R, S)) = Sp3(Nestn'( Zpi(R), Sp2(S))) (generic combo-nest law) 

where pi, p2, p3 are p-formers and n' is an n-former to be described next. The following 
can be chosen: p3=p, n -n, and pl=p2=_, and the identity holds. But, pi and p2 are 
chosen to do as much as possible of the work that p does. 
30 Let ui=<tagii:rii | ... tagin:rin> U2=<tag2i:r2i | ... tag2m:r2m>. 
The n-former n has the form: 

n = <tagii:[bi:<tag2i/cii|... |tag2m/cim>] |i=l,...,n> 
The p-former p has the form: 

35 P^<Pl I ••■ |pk> 

where: pi= tag^. /Ci*^tag3i:qi, q' : r,j. ® [b^. :{u2}] ->bool, i= 1, k, 

where {ji, jk} c {1, 2, ...,n} 



Combo-nest law 1: Assume that, for some i=l,. , .,k, the predicate q' ignores the nested part 
40 (formally: $1 . bj. does not occur in q'), and that qi is the identity p-former. That is: 
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pi= tag,j. /ci'^tagsiL 
Define: 

Pi'= tag,j. / Ci' tag,j. :_ /* i.e., tag,j. replaces tagji in p; */ 
pi =<pi'> 

5 p3 = <pi I . . . I pi.i I tag,j^ tag3i:_ | pi+i | . . . Pn> 

The combo-nest law is: 

Ep(Nestn(R, S)) = Ip3(Nest„( Spi(R), S)) (combo-nest law 1) 
Here n -n and p2 = _. 

10 Combo-nest law 2: Assume that, for every i=l, . . .,k, the predicate Ci' only depends on its first 
argument (i.e., it ignores the nested part, bj. ), and that qi leaves all fields unchanged except 

for bj. where it applies a selection, and that, moreover, that selection is the same for all 
i=l,. . .,k. More precisely: 

qi = [...,bj^-^b. :r] 

15 where . . . contains only identity p-formers, i.e., a^a:_ for attributes a, and where: 

r= <tag2l/Cii"| ... |tag2m/Cim"> 

is a selection that is the same for every i=l , . . . ,k. Define: 
pi'= tagij. / Ci' ^ tag,j. :_ for i = 1, . . ., k 



/* i.e., _ replaces qi and tag,j. replaces tagsi in pi */ 



20 



pl=<pi'|... |p„'> 
p2 = r 



p3 = < tagiji -> tagisi 



I ... I tagijk tagiaic :_ > 



The combo-nest law is: 



25 



Zp(Nestn(R, S)) = SpsCNestnC Zpi(R), Ep2(S))) (combo-nest law 2) 
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4.1.4 The Combo-Join Law 



Let j :: ul, u2 ^ u be a j-former and p::u -> u' be a p-former. Then the following 
holds: 

Sp(R [X] j S) = Zp3(Zpi(R)^<]j Ep2(S)) (the combo-join law) 
where pi, p2, p3 are defined below. 
Notations: 

j = <tagli, tag2j / Cij tagij' | i = 1, nj = 1, m > 
p= <tagijVcij'^ tagij" : qij |i= l,...,n, j = 1, ...,m> 

Here cy : ru x r2j -> bool is the join condition, while Cij': tu x r2j -> bool is a selection 
15 predicate. Not all combinations of tags tagij' niust occur in p, but to simplify notations 
they are included. Assume that Cij'(x,y) = dij(x) AND eij(y) AND fij(x,y). That is dij(x) 
contains all conditions that only inspect only values from the left join operand, eij(y) 
those conditions that only inspect values from the right join operand, while fij(x,y) 
contains conditions that inspect values from both operands and cannot be separated. The 
20 conditions on the left operand are independent of j, i.e.,: 

dii(x) = di2(x) = , . . = dim(x) = di(x) for i=l , . , . , n 

and similarly: 

eij(x) = e2j(x) = . . . = enj(x) = ej(x) for j=l , . . . , m 

This can often be achieved, by manipulating boolean conditions, factoring out the 
common parts and pushing the specific parts into fij(x,y). Then define: 

pi - <tagl 1 / di ^ tagli, tagln/ dn ^ tagln> 
p2 = <tag2i / ei tag2i, tag2m/ Qm tag2m> 
p3= <tagij7f;j^ tagij" : q ij | i = l,...,n, j = 1, ...,m> 

35 



4.2 Laws that commute joins and nests 

40 4.2.1 Join Associativity and Commutativlty 

Join commutativlty holds: 
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R jS = S jR (join commutativity law) 

Join associativity: 

5 (R|X]ji S)|><lj2T = R[><|j3(S[><lj4T) Coin associativity law) 

holds for various choices of the j-formers j3 and j4. 

In R ji S only a subset of all pairs of tags need to have join conditions. To simplify 
10 notations, however, each join considers all pairs of tags. Hence: 

jl =<tagii, tag2j/ciij-^tag4ij | i = 1,..., nj = 1, m> 

j2 = <tag4ij, tag3k / C2ijk ^ tagijk | i = 1,..., nj - 1, m, k = 1,..., p> 

15 The condition C2ijk looks at three records: x from R, y from S, and z from T. 
Decomposing it into two pieces: 

C2ijk(x,y,z) = C3ijk(x,y,z) AND C4jk(y,z) 

20 such that the part inspecting only y and z is the same for all i=l, . . . ,n. That is, in order to 
define C4jk(y,z) we inspect each condition C2ijk(x,y,z), C2njk(x,y,z) and factor out what 
all of them do in common with y and z. Then define: 

j4 = <tag2j, tag3k / C4jk tagsjk | j=l , . . . ,m, k = 1, . . . ,p> 
25 j3 = <tagii, tag5jk / C3ijk tagijk | i=l , . . . ,n, j= 1 , . . . ,m, k=l , . . , ,p> 




4.2.2 The Join-Nest Rule 

30 The following holds: 

R [Xi ji (Nestni(S, T)) = Nestn2((R[><] j2 S), T) Qoin-nest law) 

provided that jl looks only at the S-component in Nestni(S, T). In that case, j2 is "the 
35 same" as j 1, except that it does not get to see the nested attribute, which jl did not use 
anyway. Similarly, n2 is "the same" as nl, except that now it gets to see an R 
component, which it ignores. 

From the above description, it will be appreciated that although the specific 
40 embodiments of the technology have been described for purposes of illustration, 
various modifications may be made without deviating fi:^om the scope of the 
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invention. Accordingly, the invention is not limited except by the appended 
claims. 
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